vanna-ai / vanna

🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using RAG 🔄.
https://vanna.ai/docs/
MIT License
11.04k stars 862 forks source link

fix: avoid duplicate DDL and question-sql in chromadb #336

Closed everdark closed 6 months ago

everdark commented 6 months ago

A potential patch for https://github.com/vanna-ai/vanna/issues/330

andreped commented 6 months ago

@everdark Thank you for contributing! :]

Hmm, this solution would not be backward compatible though. If a user had an existing vector store, updating vanna would mean that duplicates would still occur. Perhaps adding hash IDs for each item is the best solution, but is there something obvious we are missing maybe?

Thoughts, @zainhoda?

zainhoda commented 6 months ago

This looks like a pretty reasonable fix -- it will only work for Chroma but there's very little downside in adding this

andreped commented 6 months ago

@zainhoda Just realized that we did not address documentation duplicates in this PR. This means that we have different behaviour for docs than for SQL and DDL training in the main branch right now: https://github.com/vanna-ai/vanna/pull/336/files#diff-c1935d21483004aa0c17dc5943337f3448e868df04ee2e31318e745e46138857L87

zainhoda commented 6 months ago

@andreped yes -- that's a good point. I'll change that