tdebatty / java-LSH

A Java implementation of Locality Sensitive Hashing (LSH)
Other
294 stars 84 forks source link

Examples show boolean vectors, what about string vectors? #27

Open stenpiren opened 4 years ago

stenpiren commented 4 years ago

Hi, I was wondering how to use this library for comparing two different Strings that are tokenized into a string vector each. The examples only show boolean vectors which are just "post-transformation". As a newbie and to make great use of the library, it would be great to have the transformation part covered in the examples.

linusjf commented 4 years ago

Excellent point. It would help to have examples converting strings and documents to vectors using something like word2vec but more lightweight.

Something like this but with a fully coded example:

https://link.medium.com/xF9m59LMr1

nanshakov commented 4 years ago

I use this approach:


import java.nio.ByteBuffer;
import java.util.Base64;
import java.util.stream.IntStream;

import info.debatty.java.lsh.MinHash;

/**
 * Hello world!
 */
public class App {

    //https://github.com/tdebatty/java-LSH
    public static void main(String[] args) {
        int signature_size = 128;
        int dictionary_size = 256;
        int initial_seed = 1234567890;

        MinHash mh = new MinHash(signature_size, dictionary_size, initial_seed);
        String pass1 = "Nikita";
        String pass2 = "Nikika";
        String pass3 = "Nikika1995strong";
        String pass4 = "qweertyuiop";

        var sig1 = mh.signature(fromString(pass1));
        var sig2 = mh.signature(fromString(pass2));
        var sig3 = mh.signature(fromString(pass3));
        var sig4 = mh.signature(fromString(pass4));

        System.out.println("sig1-sig2 : " + mh.similarity(sig1, sig2));
        System.out.println("sig1-sig3 : " + mh.similarity(sig1, sig3));
        System.out.println("sig1-sig4 : " + mh.similarity(sig1, sig4));

        println(sig1);
        println(sig2);
        println(sig3);
        println(sig4);
    }

    public static boolean[] fromString(String str) {
        var bytes = str.getBytes();
        var result = new boolean[256];
        int index = 0;
        for (int i = 0; i < str.length(); i++, index += 8) {
            var bits = byteToBoolArr(bytes[i]);
            System.arraycopy(bits, 0, result, index, bits.length);
        }
        return result;
    }

    public static boolean[] byteToBoolArr(byte b) {
        var boolArr = new boolean[8];
        for (int i = 0; i < 8; i++) {
            boolArr[i] = (b & (byte) (128 / Math.pow(2, i))) != 0;
        }
        return boolArr;
    }

    static void println(final int[] array) {
        System.out.println(convertToBase64(array));
    }

    public static String convertToBase64(int[] ints) {
        ByteBuffer buf = ByteBuffer.allocate(ints.length);
        IntStream.of(ints).forEach(i -> buf.put((byte)i));
        return Base64.getEncoder().encodeToString(buf.array());
    }
}
Jurian commented 2 years ago

This is a very interesting method and I would like to use it in my project. At the moment I am using ngrams to create boolean vectors, perhaps this works better and faster.

I do have a few questions/observations:

  1. You copy 8 bits into the result array for every character in the string, this means the strings have a maximum length?
  2. I assume the signature size and the number 128 in byteToBoolArr() are the same variable?
  3. When does it make sense to change the signature size?

Cheers

PS. I slightly changed your code to make more use of (Java) constants:

protected static final int DIMENSIONS = 512;
protected static final int SIGNATURE_SIZE = 128;

private boolean[] byteToBoolArr(byte b) {
    boolean[] boolArr = new boolean[Byte.SIZE];
    for (int i = 0; i < Byte.SIZE; i++) {
        boolArr[i] = (b & (byte) (SIGNATURE_SIZE / Math.pow(2, i))) != 0;
    }
    return boolArr;
}

private boolean[] fromString(String token) {
    if(token.length() / Byte.SIZE > DIMENSIONS) {
        throw new IllegalArgumentException(
                "Input string too large to be encoded. " +
                "Would need at least " + (token.length()*Byte.SIZE) + " dimensions.");
    }
    byte[] bytes = token.getBytes();
    boolean[] result = new boolean[DIMENSIONS];
    for (int c_i = 0, b_i = 0; c_i < bytes.length; c_i++, b_i += Byte.SIZE) {
        System.arraycopy(byteToBoolArr(bytes[c_i]), 0, result, b_i, Byte.SIZE);
    }
    return result;
}
jianshu93 commented 1 month ago

Hi all,

I have the same question here, why do not we use some random hash functions to hash strings into some u64 and consider it as bit vectors/binary vectors, this is how it was used in practice since I can imagine for truely random hash functions, same strings will be hashed to the same binary vectors. Check the Rust version of SRP-LSH here: https://github.com/serega/gaoya

Any idea how to modified it?

Thanks,

Jianshu