Closed SebG-js closed 1 month ago
In postgreSQL, the way to avoid inserting duplicate rows is add a UNIQUE
constraint at create table. This is very trivial for postgreSQL native types like meta(JSONB) and text(TEXT), but impossiable for embedding(VECTOR). UNIQUE
contraint is only available for btree
index, which is not implemented by PGVecto.rs type VECTOR.
However, a vector
is usually generated from a text
by an embedding model like openai embedding
. A same text will always generate a same vector. It's almost impossible for different text to generate a totally same vector. Thus we can say, text and vector are one-to-one mapping to a certain extent.
Therefore add a UNIQUE
constraint to text is the same to a UNIQUE
constraint at vector.
In next release, we will provide a way to add UNIQUE
constraint to:
And here is the usage of them:
# Constraint on text
client1 = PGVectoRs(
db_url=URL,
collection_name="client1",
dimension=3,
recreate=True,
constraints=[Unique(columns=[Column.TEXT])],
)
# Constraint on meta
client2 = PGVectoRs(
db_url=URL,
collection_name="client2",
dimension=3,
recreate=True,
constraints=[Unique(columns=[Column.META])],
)
# Constraint on text & meta
client3 = PGVectoRs(
db_url=URL,
collection_name="client3",
dimension=3,
recreate=True,
constraints=[Unique(columns=[Column.TEXT, Column.META])],
)
# Constraint on text | meta
client3 = PGVectoRs(
db_url=URL,
collection_name="client3",
dimension=3,
recreate=True,
constraints=[Unique(columns=[Column.TEXT]), Unique(columns=[Column.META])],
)
The feature PR is ready and will be posted for one week. Please let me know if you have any questions or comments.
Hello Thank you for your explanations. I understand very well the usecase of the AND constraint like text + meta. I am not sure to understand the OR usecase (Constraint on text | meta). It could be tricky.
With the method you propose, it will prevent inserting duplicates. Nice!
To modify a previous record, we have to delete it and recreate it ?
I am not sure to understand the OR usecase (Constraint on text | meta).
Let me explain it in a simpler way. An OR constraint can be explained as:
Clause 1 can be set as constraint on text
, and clause 2 can be set as constraint on meta
, while clause 3 can be covered by any of them.
So a OR constraint can be simply translated as two constraints~
How to avoid inserting duplicate records in the database ?
Duplicates could concern every field : text, embedding, metadata dict or only some fields like text and/or embedding
alternatives : 1) insert with bool option ignore exact duplicate
2) insert with bool option ignore exact duplicate + duplicate keys list
3) to be defined update method ?