jkomoros / card-web

The web app behind thecompendium.cards
Apache License 2.0
46 stars 8 forks source link

There appear to be duplicate embeddings in the production embedding store #691

Closed jkomoros closed 3 months ago

jkomoros commented 3 months ago

Running reindexCardEmbeddings appears to add embeddings even ones that should already be in the store?

If you look at any given point ID and search by similar you'll find a huge number of duplicates with the same card id and version.

jkomoros commented 3 months ago

Yeah, the number of vectors in production is 100k, instead of the expected ~15k, 10x larger than expected, likely due to these extra reembeddings.

jkomoros commented 3 months ago
jkomoros commented 3 months ago

My guess is that it's in reindexCardEmbeddings, it's bulk-fetching all of the items, but the cardsInfo is coming back incorreclty. Looks like it gets the content field but not the card_id field?

jkomoros commented 3 months ago

BTW this "lots of duplicates of the same card content and embedding" is likely why the semanticSort in #688 was finding so many non-existent embeddings? Maybe? Because it was fetching a random embedding for that cardID?

jkomoros commented 3 months ago

The bug that has been fixed was leading to a lot of duplicate embeddings being stored, every time reindexCardEmbeddings was run, which was on every deploy.

jkomoros commented 3 months ago

This is now fixed and deployed into production