tdebatty / spark-knn-graphs

Spark algorithms for building k-nn graphs
MIT License
41 stars 15 forks source link

A couple of questions #1

Closed MLnick closed 8 years ago

MLnick commented 9 years ago
  1. how scalable is the implementation? what size of input data have you tested against?
  2. is there a way (or potentially a way) to compute the NN based on a filter (e.g. only compute from a subset of input vectors filtered by some other node attribute, or ID etc).

For 2, the use case would be say to compute user- and item- vectors in a CF model, and compute the NN vector set for item-item similarity, as well as user-item scores.

tdebatty commented 9 years ago

Hello,

The NNDescent algorithm, which is currently the most robust, was tested with a dataset of 4 millions items (spam titles).

For filtering, the cleanest way is by first filtering your RDD: http://spark.apache.org/docs/latest/programming-guide.html#transformations http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaRDD.html

In any way, don't hesitate to keep me informed...

Regards,

MLnick commented 9 years ago

Ok thanks - from a quick look through the code it appears that this could potentially scale out with item size if the number of buckets is made large enough. Will be interesting to confirm.

What about LSHSuperBit for DoubleArray compared to NNDescent?

For #2, what I meant was rather this:

Say I have 20 million user vectors, and 5 million item vectors. I'd like to compute the NN for item-item cosine similarity. This is easy, I just pass in RDD[Node] where each Node is an item vector.

Now say I would like to compute the user-item "similarity" (in reality, I'd take the dot product, but cosine sim could suffice and I guess one could implement a simple dot product "similarity" too). I could pass in an RDD[Node] containing all the vectors (25 million say). But if I don't care about the user-user similarities, it is wasteful to compute them.

So what I'd want is to apply some step so that, for each user vector, I compute the NN from the set of item vectors.

Hope this is more clear.

tdebatty commented 9 years ago

Hello,

The drawback of LSHSuperBit (and other partitioning algorithms) is that if your data is skewed (and it is often so), a lot of items (or most items) might fall in a few buckets. This would largely decrease the efficiency of the algorithm... So you might give LSHSuperBit a try, but NNDescent is definitely more robust!

For your second question, this use case is currently not possible, but I can certainly add a filter hook. The interface would probably look like /*

What do you think about this?

Regards,

On Fri, Aug 28, 2015 at 9:45 AM, MLnick notifications@github.com wrote:

Ok thanks - from a quick look through the code it appears that this could potentially scale out with item size if the number of buckets is made large enough. Will be interesting to confirm.

What about LSHSuperBit for DoubleArray compared to NNDescent?

For #2, what I meant was rather this:

Say I have 20 million user vectors, and 5 million item vectors. I'd like to compute the NN for item-item cosine similarity. This is easy, I just pass in RDD[Node] where each Node is an item vector.

Now say I would like to compute the user-item "similarity" (in reality, I'd take the dot product, but cosine sim could suffice and I guess one could implement a simple dot product "similarity" too). I could pass in an RDD[Node] containing all the vectors (25 million say). But if I don't care about the user-user similarities, it is wasteful to compute them.

So what I'd want is to apply some step so that, for each user vector, I compute the NN from the set of item vectors.

Hope this is more clear.

— Reply to this email directly or view it on GitHub https://github.com/tdebatty/spark-knn-graphs/issues/1#issuecomment-135664073 .

MLnick commented 9 years ago

Ok, that makes sense

The interface for prefilter looks good - though it looks like Node can only use the id String to do the filter? In which case I'd need to hack the id a bit to be something like "54_user" and "71_item" etc.

Or, can I define a value for the Node that is a (type, vector) pair, and define the similarity interface such that it operates on the (type, vector) but computes the similarity using only the vector part of the tuple?

tdebatty commented 9 years ago

Hi,

Just realized that you can achieve the same effect without this new interface.

You need to:

I posted an example on GitHub: https://github.com/tdebatty/spark-knn-graphs/blob/master/src/main/java/info/debatty/spark/knngraphs/example/NNDescentCustomValue.java

Don't hesitate to let me know how it works for you...

Regards,

On Mon, Sep 7, 2015 at 9:52 AM, MLnick notifications@github.com wrote:

Ok, that makes sense

The interface for prefilter looks good - though it looks like Node can only use the id String to do the filter? In which case I'd need to hack the id a bit to be something like "54_user" and "71_item" etc.

Or, can I define a value for the Node that is a (type, vector) pair, and define the similarity interface such that it operates on the (type, vector) but computes the similarity using only the vector part of the tuple?

— Reply to this email directly or view it on GitHub https://github.com/tdebatty/spark-knn-graphs/issues/1#issuecomment-138218945 .