Avoid importing the same files multiple times into a vector store

patterns-ai-core / langchainrb

Build LLM-powered applications in Ruby

https://rubydoc.info/gems/langchainrb

MIT License

1.18k stars 156 forks source link

Avoid importing the same files multiple times into a vector store #667

Closed ahx closed 1 week ago

ahx commented 2 weeks ago

Is your feature request related to a problem? Please describe. When using Langchain::Vectorsearch::Pgvector#add_data with file paths, I noticed that it imports the same file again if it was already imported in a previous run, which I did not expect to happen.

Describe the solution you'd like I was thinking of adding a checksum collumn or the like to the default schema and skip files with the same checksum.

Describe alternatives you've considered I was trying to add that logic described above myself, but found it too hacky and I think having that behaviour makes sense to have by default.

Additional context Thanks for the work on this project.

ahx commented 2 weeks ago

If this is something that makes sense to you I can try to compile a PR.

andreibondarev commented 1 week ago

@ahx I'm not convinced that this belongs in Langchain.rb. Shouldn't the developer do this in their own application instead?

ahx commented 1 week ago

@andreibondarev I think I understand what you mean. The main reason to create this ticket was that I stumbled across this while trying to build a demo to show Langchain.rb to my colleagues and without this it does not demo that nice. But I totally agree. In a production setup you probably have that step implemented before the indexing part.

I'll close this issue. Thanks for the feedback. 🚀

andreibondarev commented 1 week ago

@ahx I'm open to considering other approaches. Maybe there's a way to get rid of redundant chunks AFTER all the files have been indexed into a vector DB. Or perhaps you can enforce uniqueness at the vector DB level by passing in metadata. Which vector DB are you using?