Closed ahx closed 1 week ago
If this is something that makes sense to you I can try to compile a PR.
@ahx I'm not convinced that this belongs in Langchain.rb. Shouldn't the developer do this in their own application instead?
@andreibondarev I think I understand what you mean. The main reason to create this ticket was that I stumbled across this while trying to build a demo to show Langchain.rb to my colleagues and without this it does not demo that nice. But I totally agree. In a production setup you probably have that step implemented before the indexing part.
I'll close this issue. Thanks for the feedback. 🚀
@ahx I'm open to considering other approaches. Maybe there's a way to get rid of redundant chunks AFTER all the files have been indexed into a vector DB. Or perhaps you can enforce uniqueness at the vector DB level by passing in metadata. Which vector DB are you using?
Is your feature request related to a problem? Please describe. When using
Langchain::Vectorsearch::Pgvector#add_data
with file paths, I noticed that it imports the same file again if it was already imported in a previous run, which I did not expect to happen.Describe the solution you'd like I was thinking of adding a checksum collumn or the like to the default schema and skip files with the same checksum.
Describe alternatives you've considered I was trying to add that logic described above myself, but found it too hacky and I think having that behaviour makes sense to have by default.
Additional context Thanks for the work on this project.