pgvector / pgvector

Open-source vector similarity search for Postgres
Other
11.81k stars 536 forks source link

Please support vector sizes of 20,000 like PinconeDb #112

Closed AceHack closed 1 year ago

AceHack commented 1 year ago

I would much rather use pgVector with PostgreSQL than Pinecone. Any suggestions? Thanks.

ankane commented 1 year ago

Hi @AceHack, can you share more about the use case?

flexchar commented 1 year ago

If I may, I'd love to hear where you get 20k embeddings, if that of course is your use case. @AceHack

ankane commented 1 year ago

Can revisit when there are use cases to discuss

AceHack commented 1 year ago

https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models#embeddings-models You can see here that Davinci embedding has 12,288 dimensions. I have to use Pinecone for this currently. @ankane can you re-open, please? Davinci is basically GPT-3. I would think you would want to support GPT3 embeddings.

ankane commented 1 year ago

You can currently store vectors with up to 16,000 dimensions.

AceHack commented 1 year ago

Sorry the documentation said 2,000. How do I get 16,000? https://github.com/pgvector/pgvector#indexing quote: 'Vectors with up to 2,000 dimensions can be indexed.' With pinecone I can index up to 20,000

The embeddings are pretty useless to me if they are just stored and not indexed. I have 10s of millions of rows.

ankane commented 1 year ago

For indexing, will hopefully be able to support more than 2,000 dimensions at some point, but you'll need to use dimensionality reduction or another project for now.

fwiw, from the link above, they "strongly recommend using text-embedding-ada-002" over the other models, so I think the Davinci use case will be less common.

nreith commented 1 year ago

@ankane Do you have any updates on when it might be on pgvector's roadmap, or postgres' to support more than 2k indexed dimensions? Also, do you have any info on how dimensionality reduction impacts accuracy? I'm basically evaluating between pgvector and redis, and there are a lot of pros with pgvector, but 2k dimensions for indexed seems a limiting factor.

jkatz commented 1 year ago

@nreith Do you have a specific dataset with > 2K dimensions that you are looking to index?

nreith commented 1 year ago

@jkatz yes. Pretty much all the new big ones in huggingface that come with embeddings.

saurik commented 1 year ago

@jkatz Another use case for >2k is that it seems the common facial recognition embedding libraries generate 2622 dimension vectors.