[Bug]: IngestionPipeline Cache does not work with num_workers > 1

main-2983 commented 3 days ago

Bug Description

When num_workers > 1, the llama_cache file is empty. When num_workers=1, the IngestionPipeline can cache normally

Version

0.11.17

Steps to Reproduce

if __name__ == '__main__':
    pipeline = IngestionPipeline(
        transformations=[
            FooNodeParser(),
            OpenAIEmbedding(
                model="text-embedding-3-small"
            )
        ],
        vector_store=vector_store
    )
    documents = SimpleDirectoryReader(path).load_data(
            show_progress=True, num_workers=4)
    nodes = pipeline.run(documents=documents, num_workers=4)
    pipeline.persist("cache")

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 3 days ago

The issue you're experiencing with the IngestionPipeline cache not working when num_workers is greater than 1 might be related to how the cache is accessed and managed across multiple processes. When num_workers is set to more than 1, the pipeline uses a ProcessPoolExecutor to run transformations in parallel, and each process can access the cache independently if it is not disabled ^[1].

However, there might be specific conditions or configurations in your setup that could lead to an empty cache file when using multiple workers, but I couldn't find specific details on this in the repository. It might be worth checking if the cache is being correctly initialized and accessed in each process, and ensuring that the cache is not being inadvertently disabled or cleared during the process execution.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 3 days ago

I need to remove this parameter, it doesn't work half the time, and tbh is usually broken

set it to 1. Or use async with pipeline.arun() to achieve concurrency. Or multi-thread it outside of the class in your code

run-llama / llama_index