Make NNCTPH take in StringProfile or SparseIntegerVector?

tdebatty / spark-knn-graphs

Spark algorithms for building k-nn graphs

MIT License

41 stars 15 forks source link

Hi. I am able to deploy LSHSuperBitNNDescentTextExample successfully in our spark cluster. I really like the idea of pre-calculating the stringProfiles via ks.getProfile and performance is good.

I am testing the NNCTPHExample and trying to feed NNCTPH the pre-calculated the stringProfiles. Unfortunately, it seems like the NNCTPH constructor and .setSimilarity only takes in String? Can we make NNCTPH take in StringProfile or SparseIntegerVector? It is a lot slower than LSHSuperBitNNDescentTextExample, and I suspect it has to recalculate the profiles at every comparison. I also replaced Jaro-Winkler with the more cost efficient Jaccard index, which improved performance slightly.

tdebatty / spark-knn-graphs

Make NNCTPH take in StringProfile or SparseIntegerVector? #4