Closed ImVexed closed 1 year ago
Good idea, will implement. 279k is a lot. What is the total size of the documents and the chunk size? I would start with a single largish chunk size. For me, on Ryzen 5800 it takes around 40 mins to process 60k chunks (~600mb of documents with chunk size 1024)
It's about 2Gb of PDFs, I left the chunk size at the default of 1024, but have a 4090 and 7950x with 128Gb RAM, should I raise it and restart indexing?
1024 should be good from retrieval perspective. Would start with smaller number of documents to check that everything works and then reindex all of them. I am using chroma as vectordb, so the actual implementation of embeddings generation is hidden. The preprocessing/parsing itself as you could probably see is relatively fast, but chroma internally takes a long time to generate. Will dig into it - it shouldn’t take such long time on a powerful system like yours.
The sunk cost is hitting in, I think I'll let it go for another few hours then kill it if it still hasn't finished. Thanks!
Done in v0.3.1
I have a couple of PDF's:
llmsearch.parsers.splitter:split:74 - Got 279643 chunks for type: pdf
And would really love to see how far along
llmsearch.chroma:create_index_from_documents:38 - Generating and persisting the embeddings..
is. As it's been a few hours now and I'm not sure if it's stuck, or if this is a hopeless amount of data to index and I'm only at 1%.