mikeheddes / dothash

Estimating Set Similarity Metrics for Link Prediction and Document Deduplication
https://arxiv.org/abs/2305.17310
GNU General Public License v3.0
7 stars 1 forks source link

Limitation on number of vectors that can be searched for dot product similarity #1

Open ksrinivs64 opened 10 months ago

ksrinivs64 commented 10 months ago

Hi thanks for a very nice paper and the code - does the dothash solution scale to millions of vectors (say millions of documents for vector search)? Or is it currently limited by whatever can be computed in memory? Thanks

mikeheddes commented 10 months ago

Hi, thank you for your interest in DotHash. DotHash simply requires storing a vector for each document or node. You could use something like a vector database to scale this to millions of vectors.

ksrinivs64 commented 10 months ago

Yes, but I was looking at ANNs and most claim to do very poorly with high dimensional vectors. Have you tried any particular one? Would you recommend something if you have? Thanks again - very cool work.

mikeheddes commented 10 months ago

I have not used vector databases, the experiments we did were small enough that everything fits in memory. Could you elaborate on the following:

I was looking at ANNs and most claim to do very poorly with high dimensional vectors

It is not clear to me what performs poorly, the ANN? Or the vector database?

ksrinivs64 commented 10 months ago

As far as I know all vector databases scale by space partitioning algorithms and the ones I looked at like FAISS said they become really inaccurate with high dimensional vectors. Kavitha

On Mon, Dec 4, 2023, 3:01 PM Mike Heddes @.***> wrote:

I have not used vector databases, the experiments we did were small enough that everything fits in memory. Could you elaborate on the following:

I was looking at ANNs and most claim to do very poorly with high dimensional vectors

It is not clear to me what performs poorly, the ANN? Or the vector database?

— Reply to this email directly, view it on GitHub https://github.com/mikeheddes/dothash/issues/1#issuecomment-1839384359, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNS6QX2U5ICFANJQH2UUH3YHYTZPAVCNFSM6AAAAABADJ5BESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGM4DIMZVHE . You are receiving this because you authored the thread.Message ID: @.***>