Closed vcahlik closed 1 month ago
This issue happens because the vector size is hardcoded in the constant ADA_TOKEN_COUNT = 1536 on pgvector.py, so it creates the column in the table with a vector of that size and it's not compatible with Instructor vector size (768).
It would be ideal if we could pass the vector size as a parameter or have it automagically obtained from some property in the embedding class.
Same here.
I am working after some modifications:
ADA_TOKEN_COUNT = 768
in local (site-packages): https://github.com/hwchase17/langchain/blob/199cb855eaf9cc7a2c3e671e96c59a8ea4d80dc8/langchain/vectorstores/pgvector.py#L22
- Enter on posgresql:
psql postgres
- List databases:
postgres=# \l
- Change to database:
postgres=# \c postgres
alter table langchain_pg_embedding alter column embedding type vector(768);
thank you for pointing that out @bukosabino . Definitely need a flexible solution here
could a environment variable be a solution for this? i could create PR for this if this would help?
pro:
con:
@woodworker that would be better than the current implementation, but passing it as an argument somewhere would be def better
i tried to find a parameter based solution but the EmbeddingStore
in pgvector.py is pretty hardwired into all the sqlalchemy
https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/pgvector.py#L70-L87
ill create a PR for the env based solution and maybe someone (maybe that someone is even me in a later time) will find a better parameterized solution
waiting for #4203 to make it through
It seems @hwchase17 committed recently #7355 a fix to make it None some time ago. My Idea would be to create the BaseModel classes within a function itself, and call it in ctor, and apply to self. For example:
def make_models(dims: Optional[int] = None):
class EmbeddingStore(BaseModel):
...
embedding: Vector = sqlalchemy.Column(Vector(dims))
return {
'EmbeddingStore': EmbeddingStore
}
class PGVector(VectorStore):
def __init__(self, dims: Optional[int] = None):
models = make_models(dims)
self.EmbeddingStore = models['EmbeddingStore']
Or something like this.
I ran into the same issue. When will this be fixed? @bukosabino's approach works but we need a permanent fix
Is this issue fixed? Looks like the code has changed and now the pgvector.py files don't have the 'ADA_TOKEN_COUNT ' variable. It's difficult now to explicitly set the value of this to some other dimension as the current one only works for OpenAI embeddings. What if we want to use sentence-transformer models? Some of these models have dimensions of 768 only. Can we have a fix that can explicitly help us set the value of the embedding dimensions based on the model chosen or can this be dynamically set?
This problem is still existing. Is there any chance for permanent fix?
I was able to use hugging face embeddings using latest langchain. I did initially encounter this issue as when I created a new collection it did not create a new table, and my prior collection used an embedding model with a diff vector dim size so I had a similar error. I simply created a new database so that the new langchain_pg_embedding table would be created with an embedding column of the correct dims.
Same issue here, with latest langchain version 0.0.333
. Index created in ChromaDB by LLM mode "google/flan-t5-xxl". When reusing index from persited ChromaDB, have tried embeddings with:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
none of works. All throw chromadb.errors.InvalidDimensionException: Embedding dimension 768 does not match collection dimensionality 1536 error.
Change to
embeddings = OpenAIEmbeddings()
or
embeddings = HuggingFaceEmbeddings(model_name="sangmini/msmarco-cotmae-MiniLM-L12_en-ko-ja")
then works as expected.
I believe those two models are 1536 dimensions. Links: The hugging face one: https://huggingface.co/sangmini/msmarco-cotmae-MiniLM-L12_en-ko-ja The open ai one: https://openai.com/blog/new-and-improved-embedding-model
Not sure how langchain's integration with chromadb works, but for pgvector the issue arose from all collection sutilizing the same table + column to store embeddings, which means the vectors can only be one size.
I have been setting the env var PGVECTOR_VECTOR_SIZE
which seems to work ok - look up the vector_size
for the model you are using first. Make sure to create the table with the right vector size column otherwise you need to delete it and recreate.
os.environ["PGVECTOR_VECTOR_SIZE"] = str(vector_size)
vectorstore = PGVector(connection_string=CONNECTION_STRING,
embedding_function=embeddings,
collection_name=vector_name,
#pre_delete_collection=True # for testing purposes
)
@MarkEdmondson1234 how are you creating the table
While trying to adapt this example https://docs.llamaindex.ai/en/latest/examples/vector_stores/postgres.html# to use a local embedding model, I was able to solve this issue by:
vector_size=768
vector_store = PGVectorStore.from_params(
database=db_name,
host="localhost",
password="mock",
port=5432,
user="mock",
table_name="test",
embed_dim=vector_size, # Ensure this matches your model's output dimensions
)
@MarkEdmondson1234
where I need to define PGVECTOR_VECTOR_SIZE variable after setting up in .env file? os.environ["PGVECTOR_VECTOR_SIZE"] = str(vector_size)
PGVector works fine for me when coupled with OpenAIEmbeddings. However, when I try to use HuggingFaceEmbeddings, I get the following error:
StatementError: (builtins.ValueError) expected 1536 dimensions, not 768
Example code:
Output: