tdebatty / spark-knn-graphs

Spark algorithms for building k-nn graphs
MIT License
41 stars 15 forks source link

Make NNCTPH take in StringProfile or SparseIntegerVector? #4

Open thiakx opened 8 years ago

thiakx commented 8 years ago

Hi. I am able to deploy LSHSuperBitNNDescentTextExample successfully in our spark cluster. I really like the idea of pre-calculating the stringProfiles via ks.getProfile and performance is good.

I am testing the NNCTPHExample and trying to feed NNCTPH the pre-calculated the stringProfiles. Unfortunately, it seems like the NNCTPH constructor and .setSimilarity only takes in String? Can we make NNCTPH take in StringProfile or SparseIntegerVector? It is a lot slower than LSHSuperBitNNDescentTextExample, and I suspect it has to recalculate the profiles at every comparison. I also replaced Jaro-Winkler with the more cost efficient Jaccard index, which improved performance slightly.

tdebatty commented 8 years ago

Hello,

Sorry for this late answer :-/

Your idea is good, but NNCTPH is currently not compatible with this approach: NNCTPH requires a simple String as input, so it can compute a hash and bin the data in different buckets, while you would like to compute similarity between the profile representation of these strings.

One solution would be to refactor NNCTPH so it uses an interface as input (instead of the Node class). I will make some tests and keep you informed...