Closed americanthinker closed 10 months ago
🤖
It's great that you've identified this issue with the num_workers
parameter causing the IngestionPipeline.run
method to hang. Your detailed description and the relevant logs/tracbacks will be very helpful for the maintainers to understand the problem.
If you're able to propose a fix or even volunteer to create a pull request to address this bug, it would be a valuable contribution to the LlamaIndex project. Your input and contributions are highly appreciated!
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@americanthinker What OS is this on?
Interesting. Will take a look shortly!
@logan-markewich & @nerdai running on an Azure DSVM image Ubuntu 20.04. I write multiprocessing code all day long on this machine.
@americanthinker did you try reducing the number of workers?
I ask because, at least in my testing, ingestion pipelines using HuggingFace embedding models were quite computationally intensive. So, I don't set a high value of workers when using these models.
I just tried your snippet of code with 4 workers on my 12 core MacBook Pro and it worked.
With that being said, progress was not being shown despite having show_progress=True
. I'll look into what's going on here too.
@nerdai just tried setting the num_workers param to the lowest possible value (2) and got the same result. Code just hangs and I get the same message when executing the keyboard interrupt. I haven't dug into the code base, so I'm wondering how is the code parallelizing the embedding step? In my experience you can't serialize a pytorch model so multiprocessing is a no-go for that step, and you'd want to take advantage of the batch processing built-in to the SentenceTransformer models anyway.
Hey @americanthinker,
I spun up an Azure DVSM Ubuntu 20.04 (8vCPU, 32GB) VM and was able to replicate the hanging bug that you were experiencing. After a bit of investigating on the issue, it turns out that we're encountering a deadlock scenario here with the forks. You can read more about it here (skip to the section "The real solution: stop plain fork()ing" for the fix).
You should just need to add the following two lines of code in your script or jupyter notebook before invoking the run
method:
from multiprocessing import set_start_method
set_start_method("spawn", force=True)
There's also a change we can make in the library to set this start method as the default. I'll submit a PR for that fix soon.
(Also, to answer your question about how we have multiprocessing setup for ingestion_pipeline.run
: you can actually see it in the traceback you shared, but here's the link to the code for convenience. Ultimately, we've got a global function run_transformations
that takes in the list of transformations
amongst other things. Its my understanding that the arguments and function are pickled/serialized as strings before being shipped off to the processors in order to reconstruct the transformation objects.)
Also, here is the script that I used:
from llama_index import SimpleDirectoryReader
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.text_splitter import SentenceSplitter
from llama_index.ingestion import IngestionPipeline
from multiprocessing import set_start_method
# import logging
# import sys
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
if __name__ == "__main__":
print("loading documents\n")
documents = SimpleDirectoryReader(
input_dir="./data/source_files"
).load_data(num_workers=2)
print("creating ingestion pipeline\n")
splitter = SentenceSplitter(chunk_overlap=0, chunk_size=128)
model_name = 'sentence-transformers/all-miniLM-L6-v2'
embed_model = HuggingFaceEmbedding(model_name=model_name, pooling='mean', embed_batch_size=64)
pipeline = IngestionPipeline(transformations=[splitter, embed_model])
print("running pipeline")
set_start_method("spawn", force=True) # it hangs without this line
nodes = pipeline.run(documents=documents[:4], num_workers=4, show_progress=True)
print("all done!")
print(len(nodes))
As for the data, I downloaded PaulGrahamEssayDataset
from our llamahub using llama-index-cli
tool prior to running the script.
llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data
@nerdai thanks for looking into that issue. The problem goes away after setting the start_method
to spawn
, but there is no appreciable speed up compared to sequential processing, in fact the performance actually degrades. It's ok though, I don't need to use the llamaindex pipeline, I can get the preprocessing completed much quicker if I push the initial steps through a multiprocessing loop and then separately create the embeddings using the built-in batching already provided in the SentenceTransformer encode
method.
@nerdai thanks for looking into that issue. The problem goes away after setting the
start_method
tospawn
, but there is no appreciable speed up compared to sequential processing, in fact the performance actually degrades. It's ok though, I don't need to use the llamaindex pipeline, I can get the preprocessing completed much quicker if I push the initial steps through a multiprocessing loop and then separately create the embeddings using the built-in batching already provided in the SentenceTransformerencode
method.
Thanks for the confirmation that it the hanging problem at least went away. Too bad it didn't lead to speed up compared to sequential processing — that can happen of course if the time to serialize and acquire locks is greater than the work that's being distributed. Perhaps also there is room for improvement to be had in our parallel processing setup :)
By all means, take the strategy that works for you and your workload. Cheers!
Setting the num_workers to anything above 1 freezes everything up for me.
@AlbertoMQ num workers doesn't really work for local models (which i saw you were using).
Tbh we should probably just remove this option
Bug Description
Setting
num_workers
to anything other than None causes theIngestionPipeline.run
method to simply hang.Version
0.9.31
Steps to Reproduce
docs=[...]
splitter = SentenceSplitter(chunk_overlap=0, chunk_size=128)
model_name = 'sentence-transformers/all-miniLM-L6-v2'
embed_model = HuggingFaceEmbedding(model_name=model_name, pooling='mean', embed_batch_size=64)
pipeline = IngestionPipeline(transformations=[splitter, embed_model])
nodes = pipeline.run(documents=docs, num_workers=os.cpu_count(), show_progress=True)
Relevant Logs/Tracbacks