Error while indexing - Githubissues

Sachin-Bhat commented 2 months ago

Hello,

I was getting the following error while indexing my embeddings onto PGVector. It did not throw such errors when I was using the old langchain-community implementation. Wondering if the database schema that created with the new implementation had some changes. Any clarification on how to go about this would be very helpful.

[SQL: INSERT INTO langchain_pg_embedding (id, collection_id, embedding, document, cmetadata) VALUES (%(id_m0)s::VARCHAR, %(collection_id_m0)s::UUID, %(embedding_m0)s, %(document_m0)s::VARCHAR, %(cmetadata_m0)s::JSONB) ON CONFLICT (id) DO UPDATE SET embedding = excluded.embedding, document = excluded.document, cmetadata = excluded.cmetadata][parameters: {'id_m0': 'eb11c383-782a-491d-925d-fce35340e169', 'collection_id_m0': UUID('bc218a7a-3b99-4685-aabf-4800bd6fe6f0'), 'embedding_m0': '[0.0012261948,-0.0032268283,-0.021794429,0.028127616,0.025112469,0.00040851647,-0.010263896,-0.00914182,0.035575997,0.07880689,0.00234741,0.033373147 ... (12471 characters truncated) ... 018,-0.03224419,0.011916031,0.005445165,0.004877243,-0.06553473,-0.051298827,0.0001616641,0.07010564,0.040146906,0.021629214,-0.01095917,0.010195057]', 'document_m0': "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur vitae ex in ante fermentum tincidunt. Donec et velit molestie, faucibus sapien vel, viverra nunc. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Praesent rhoncus lacinia semper. Vestibulum non magna ac est egestas tempor. Suspendisse tempor mi nec vestibulum egestas. Vivamus a efficitur nisi, eget molestie nibh. Praesent at lorem lacus. Vivamus id nisl efficitur, faucibus justo nec, varius arcu. Praesent elementum sem euismod finibus commodo. In imperdiet tincidunt metus eget suscipit. Curabitur consectetur mi vitae sapien placerat mattis. Phasellus pellentesque ante id felis congue, tempus. \n", 'cmetadata_m0': Jsonb({'chunk_index': 5, 'id': 'eb11c383- ... (4083 chars))}](Background on this error at: https://sqlalche.me/e/20/f405)

Cheers, Sachin

eyurtsev commented 2 months ago

I changed the schema due to issues with the community implementation.

Easiest solution is to re-create the schema if you don't mind re-indexing the data.

If you don't want to re-index the data, connect to the postgres instance and you'll need to write a bit of SQL to make the migration.

If I'm not mistaken the main change in the schema:

1) Rename custom_id to id in langchain_pg_embedding 2) JSON column was changed to JSONB

Sachin-Bhat commented 2 months ago

Hey @eyurtsev,

Tried writing the necessary SQL to update the schema as specified by you. However, when I try to index documents, I get the following error:

Error indexing document with id 000218462 and chunk_id e49fa716-422a-430a-b633-7d0901ed45d3: (psycopg.errors.InvalidColumnReference) there is no unique or exclusion constraint matching the ON CONFLICT specification

Cheers, Sachin

eyurtsev commented 2 months ago

You need something like this to add a uniqueness constraint. This fixes a bug that is present in the community implementation and will make sure that you won't have duplicated content if you're indexing by id

ALTER TABLE [your_table_name]
ADD CONSTRAINT id_unique UNIQUE (id);

Double check the commands / that it matches what you have in the database. Make a backup if you need before running it etc.

Sachin-Bhat commented 2 months ago

Hey @eyurtsev,

I have tried to replicate the langchain-community indexed documents table schema to match that of langchain-postgres as closely as possible. I do still get some errors:

Error indexing document with id 000134955 and chunk_id 67cf6d54-7885-47ab-a42f-91367c070fd7: (psycopg2.errors.NotNullViolation) null value in column "uuid" of relation "langchain_pg_embedding" violates not-null constraint
DETAIL:  Failing row contains (058cabb5-a81a-4fee-b77c-3a3acdc04454, [0.019579852,-0.03575338,-0.01300123,0.0077682347,0.020762963,-0..., lorem ipsum ..., {"id": "67cf6d54-7885-47ab-a42f-91367c070fd7", "Ack": "", "CveId..., 67cf6d54-7885-47ab-a42f-91367c070fd7, null).

[SQL: INSERT INTO langchain_pg_embedding (id, collection_id, embedding, document, cmetadata) VALUES (%(id_m0)s, %(collection_id_m0)s::UUID, %(embedding_m0)s, %(document_m0)s, %(cmetadata_m0)s) ON CONFLICT (id) DO UPDATE SET embedding = excluded.embedding, document = excluded.document, cmetadata = excluded.cmetadata]
[parameters: {'id_m0': '67cf6d54-7885-47ab-a42f-91367c070fd7', 'collection_id_m0': UUID('058cabb5-a81a-4fee-b77c-3a3acdc04454'), 'embedding_m0': '[0.019579852,-0.03575338,-0.01300123,0.0077682347,0.020762963,-0.018955793,0.016758585,0.015887503,0.014431365,0.043164082,-0.05083481,0.036741473,0. ... (12432 characters truncated) ... .040121794,-0.0029317772,0.023233198,-0.013586285,-0.043216087,-0.03837963,0.033075128,0.036013406,0.034115225,-0.023662237,-0.03458327,0.0066761314]', 'document_m0': 'lorem ipsum sit amet.\n', 'cmetadata_m0': 'metadata items'}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

Any help here would be much appreciated.

Cheers, Sachin

langchain-ai / langchain-postgres

Error while indexing #30