Open main-2983 opened 3 days ago
The issue you're experiencing with the IngestionPipeline
cache not working when num_workers
is greater than 1 might be related to how the cache is accessed and managed across multiple processes. When num_workers
is set to more than 1, the pipeline uses a ProcessPoolExecutor
to run transformations in parallel, and each process can access the cache independently if it is not disabled [1].
However, there might be specific conditions or configurations in your setup that could lead to an empty cache file when using multiple workers, but I couldn't find specific details on this in the repository. It might be worth checking if the cache is being correctly initialized and accessed in each process, and ensuring that the cache is not being inadvertently disabled or cleared during the process execution.
To continue talking to Dosu, mention @dosu.
I need to remove this parameter, it doesn't work half the time, and tbh is usually broken
set it to 1. Or use async with pipeline.arun()
to achieve concurrency. Or multi-thread it outside of the class in your code
Bug Description
When
num_workers
> 1, the llama_cache file is empty. Whennum_workers=1
, the IngestionPipeline can cache normallyVersion
0.11.17
Steps to Reproduce
Relevant Logs/Tracbacks
No response