run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.08k stars 5.13k forks source link

[Question]: Is `llama_index` thead-safe? #13375

Closed ashishb closed 2 months ago

ashishb commented 5 months ago

Question Validation

Question

Is it safe to use llama_index for multiple LLM requests in parallel? Is it thread-safe or does one need locking?

dosubot[bot] commented 5 months ago

Hey @ashishb, nice to see you around here again! I'm on it, and I'll get back to you with a detailed answer shortly.

logan-markewich commented 5 months ago

In general I think it depends what features you are using. Most LLM clients have some internal locking (like the openai client does). Some LLMs will only process things sequentially no matter way (i.e. huggingface, Ollama).

Things that are writing to in-memory objects are likely not thread safe (like the default vector store). Reading is typically fine though.

In general, using async/concurrency will be much safer than trying to manage threads.

dosubot[bot] commented 5 months ago

The LlamaIndex framework's approach to handling multiple LLM requests in parallel, particularly regarding thread safety, primarily depends on the specific features you're using. Most LLM clients, including the OpenAI client, incorporate some form of internal locking, which can provide a level of thread safety. However, certain LLMs process requests sequentially by design, such as Hugging Face and Ollama, which means they inherently avoid concurrent processing issues.

For operations that involve writing to in-memory objects, like the default vector store in LlamaIndex, thread safety is not guaranteed. These operations are likely not thread-safe, and special care should be taken when multiple threads might write to the same in-memory objects. Reading from these objects, on the other hand, is generally safe and does not pose the same level of risk for concurrent operations.

The framework does show a preference for asynchronous programming, which is evident from its support for async operations. This suggests that leveraging async/concurrency features, as opposed to managing threads directly, could be a safer and more efficient approach for handling parallel LLM requests. Async programming, especially with the use of asyncio, is more suited for I/O-bound tasks and can help manage concurrent operations without the complexities and risks associated with direct thread management.

In summary, while LlamaIndex supports asynchronous operations which can facilitate handling multiple LLM requests in parallel, caution should be exercised with write operations to shared in-memory objects due to potential thread safety issues. Leveraging the framework's async features and being mindful of the operations that are not thread-safe are key considerations for safely using LlamaIndex in a concurrent environment.

#### Details **Sources** - [Is Llama Index thread safe? Particularly the Index stores](https://github.com/run-llama/llama_index/issues/830) - [[Feature Request]: Parallelize sync APIs with multi-threading](https://github.com/run-llama/llama_index/issues/6708) - [llama-index-legacy/llama_index/legacy/llms/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/base.py) - [llama-index-legacy/llama_index/legacy/ingestion/pipeline.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/ingestion/pipeline.py) - [llama-index-core/llama_index/core/storage/index_store/keyval_index_store.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/storage/index_store/keyval_index_store.py) - [llama-index-core/llama_index/core/indices/common_tree/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/common_tree/base.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

ashishb commented 5 months ago

In general, using async/concurrency will be much safer than trying to manage threads.

@logan-markewich thanks. Can you elaborate on using async/concurrency will be much safer than trying to manage threads.? Any examples/docs would be helpful.

Context: I am using llama-index with a Python web server.

logan-markewich commented 5 months ago

I just mean using async calls in something like fastapi (I.e query_engine.aquery(), chat_engine.achat() , etc.).

Otherwise for threading, it's safer to create objects from scratch for each request, (and use remote models and vector stores)