Open stenpiren opened 4 years ago
Excellent point. It would help to have examples converting strings and documents to vectors using something like word2vec but more lightweight.
Something like this but with a fully coded example:
I use this approach:
import java.nio.ByteBuffer;
import java.util.Base64;
import java.util.stream.IntStream;
import info.debatty.java.lsh.MinHash;
/**
* Hello world!
*/
public class App {
//https://github.com/tdebatty/java-LSH
public static void main(String[] args) {
int signature_size = 128;
int dictionary_size = 256;
int initial_seed = 1234567890;
MinHash mh = new MinHash(signature_size, dictionary_size, initial_seed);
String pass1 = "Nikita";
String pass2 = "Nikika";
String pass3 = "Nikika1995strong";
String pass4 = "qweertyuiop";
var sig1 = mh.signature(fromString(pass1));
var sig2 = mh.signature(fromString(pass2));
var sig3 = mh.signature(fromString(pass3));
var sig4 = mh.signature(fromString(pass4));
System.out.println("sig1-sig2 : " + mh.similarity(sig1, sig2));
System.out.println("sig1-sig3 : " + mh.similarity(sig1, sig3));
System.out.println("sig1-sig4 : " + mh.similarity(sig1, sig4));
println(sig1);
println(sig2);
println(sig3);
println(sig4);
}
public static boolean[] fromString(String str) {
var bytes = str.getBytes();
var result = new boolean[256];
int index = 0;
for (int i = 0; i < str.length(); i++, index += 8) {
var bits = byteToBoolArr(bytes[i]);
System.arraycopy(bits, 0, result, index, bits.length);
}
return result;
}
public static boolean[] byteToBoolArr(byte b) {
var boolArr = new boolean[8];
for (int i = 0; i < 8; i++) {
boolArr[i] = (b & (byte) (128 / Math.pow(2, i))) != 0;
}
return boolArr;
}
static void println(final int[] array) {
System.out.println(convertToBase64(array));
}
public static String convertToBase64(int[] ints) {
ByteBuffer buf = ByteBuffer.allocate(ints.length);
IntStream.of(ints).forEach(i -> buf.put((byte)i));
return Base64.getEncoder().encodeToString(buf.array());
}
}
This is a very interesting method and I would like to use it in my project. At the moment I am using ngrams to create boolean vectors, perhaps this works better and faster.
I do have a few questions/observations:
Cheers
PS. I slightly changed your code to make more use of (Java) constants:
protected static final int DIMENSIONS = 512;
protected static final int SIGNATURE_SIZE = 128;
private boolean[] byteToBoolArr(byte b) {
boolean[] boolArr = new boolean[Byte.SIZE];
for (int i = 0; i < Byte.SIZE; i++) {
boolArr[i] = (b & (byte) (SIGNATURE_SIZE / Math.pow(2, i))) != 0;
}
return boolArr;
}
private boolean[] fromString(String token) {
if(token.length() / Byte.SIZE > DIMENSIONS) {
throw new IllegalArgumentException(
"Input string too large to be encoded. " +
"Would need at least " + (token.length()*Byte.SIZE) + " dimensions.");
}
byte[] bytes = token.getBytes();
boolean[] result = new boolean[DIMENSIONS];
for (int c_i = 0, b_i = 0; c_i < bytes.length; c_i++, b_i += Byte.SIZE) {
System.arraycopy(byteToBoolArr(bytes[c_i]), 0, result, b_i, Byte.SIZE);
}
return result;
}
Hi all,
I have the same question here, why do not we use some random hash functions to hash strings into some u64 and consider it as bit vectors/binary vectors, this is how it was used in practice since I can imagine for truely random hash functions, same strings will be hashed to the same binary vectors. Check the Rust version of SRP-LSH here: https://github.com/serega/gaoya
Any idea how to modified it?
Thanks,
Jianshu
Hi, I was wondering how to use this library for comparing two different Strings that are tokenized into a string vector each. The examples only show boolean vectors which are just "post-transformation". As a newbie and to make great use of the library, it would be great to have the transformation part covered in the examples.