snexus / llm-search

Querying local documents, powered by LLM
MIT License
481 stars 60 forks source link

Progress while creating index from documents #51

Closed ImVexed closed 1 year ago

ImVexed commented 1 year ago

I have a couple of PDF's: llmsearch.parsers.splitter:split:74 - Got 279643 chunks for type: pdf
And would really love to see how far along llmsearch.chroma:create_index_from_documents:38 - Generating and persisting the embeddings.. is. As it's been a few hours now and I'm not sure if it's stuck, or if this is a hopeless amount of data to index and I'm only at 1%.

snexus commented 1 year ago

Good idea, will implement. 279k is a lot. What is the total size of the documents and the chunk size? I would start with a single largish chunk size. For me, on Ryzen 5800 it takes around 40 mins to process 60k chunks (~600mb of documents with chunk size 1024)

ImVexed commented 1 year ago

It's about 2Gb of PDFs, I left the chunk size at the default of 1024, but have a 4090 and 7950x with 128Gb RAM, should I raise it and restart indexing?

snexus commented 1 year ago

1024 should be good from retrieval perspective. Would start with smaller number of documents to check that everything works and then reindex all of them. I am using chroma as vectordb, so the actual implementation of embeddings generation is hidden. The preprocessing/parsing itself as you could probably see is relatively fast, but chroma internally takes a long time to generate. Will dig into it - it shouldn’t take such long time on a powerful system like yours.

ImVexed commented 1 year ago

The sunk cost is hitting in, I think I'll let it go for another few hours then kill it if it still hasn't finished. Thanks!

snexus commented 1 year ago

Done in v0.3.1