nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.17k stars 117 forks source link

when trying to load multiple documents with joblib, get error cannot pickle #27

Closed player1024 closed 8 months ago

player1024 commented 8 months ago

I am trying to parallelize ingestion of multiple, locally-stored PDFs, in my vectorstore.

when trying to load multiple documents with joblib, get error cannot pickle

PicklingError: Could not pickle the task to send it to the workers.

is this because of the API call involving accessing an external server for every PDF I am loading with llmsherpa? What would be a workaround for this? Making this async (if yes, how)?

I think this is important for production.

thank you

ansukla commented 8 months ago

Hi @player1024 - Please share your code. Parallelizing this should be similar to parallelizing any IO task. I think it will be better to create a separate LayoutPDFReader instance for each thread rather than reuse the same one.

ansukla commented 8 months ago

Closing the issue as it has been resolved.