Closed AceHack closed 1 year ago
Hi @AceHack, can you share more about the use case?
If I may, I'd love to hear where you get 20k embeddings, if that of course is your use case. @AceHack
Can revisit when there are use cases to discuss
https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models#embeddings-models You can see here that Davinci embedding has 12,288 dimensions. I have to use Pinecone for this currently. @ankane can you re-open, please? Davinci is basically GPT-3. I would think you would want to support GPT3 embeddings.
You can currently store vectors with up to 16,000 dimensions.
Sorry the documentation said 2,000. How do I get 16,000? https://github.com/pgvector/pgvector#indexing quote: 'Vectors with up to 2,000 dimensions can be indexed.' With pinecone I can index up to 20,000
The embeddings are pretty useless to me if they are just stored and not indexed. I have 10s of millions of rows.
For indexing, will hopefully be able to support more than 2,000 dimensions at some point, but you'll need to use dimensionality reduction or another project for now.
fwiw, from the link above, they "strongly recommend using text-embedding-ada-002" over the other models, so I think the Davinci use case will be less common.
@ankane Do you have any updates on when it might be on pgvector's roadmap, or postgres' to support more than 2k indexed dimensions? Also, do you have any info on how dimensionality reduction impacts accuracy? I'm basically evaluating between pgvector and redis, and there are a lot of pros with pgvector, but 2k dimensions for indexed seems a limiting factor.
@nreith Do you have a specific dataset with > 2K dimensions that you are looking to index?
@jkatz yes. Pretty much all the new big ones in huggingface that come with embeddings.
@jkatz Another use case for >2k is that it seems the common facial recognition embedding libraries generate 2622 dimension vectors.
I would much rather use pgVector with PostgreSQL than Pinecone. Any suggestions? Thanks.