Open AlbertoMQ opened 1 month ago
The low GPU usage you're experiencing with the model "HuggingFaceH4/zephyr-7b-beta" could be due to the quantization settings in your configuration. The model is likely being loaded in 4-bit precision, which reduces the model size and computational requirements, leading to lower GPU usage. This is controlled by the BitsAndBytesConfig
with load_in_4bit=True
. If the model fails to load with quantization due to missing CUDA, it defaults to a full precision model, which might not fully utilize the GPU if the system's resources are not optimal for such a configuration [1].
To continue talking to Dosu, mention @dosu.
@dosu I hadn't set the BitsAndBytesConfig anywhere, but perhaps it is quantized. Is there a suggested way to run multiple queries at once to fully utilize the GPU?
To run multiple queries at once and optimize GPU usage in the LlamaIndex framework, you can use the batched
function from the NVIDIARerank
class. This function allows you to process multiple queries in batches, which can help in efficiently utilizing the GPU. Here's a brief overview of how it works:
def batched(ls: list, size: int) -> Generator[List[NodeWithScore], None, None]:
for i in range(0, len(ls), size):
yield ls[i : i + size]
This function divides a list into smaller batches of a specified size. You can use it to batch your queries and process them together, which can lead to better GPU utilization by handling multiple queries simultaneously [1].
To continue talking to Dosu, mention @dosu.
@dosu I don't want to rerank. I want to run batched RetrieverQueryEngine
To run batched queries with the RetrieverQueryEngine
in the LlamaIndex framework, you can utilize the asynchronous capabilities of the engine. The RetrieverQueryEngine
supports parallel processing of queries through its asynchronous methods, such as aretrieve
. This allows you to handle multiple queries in parallel, which can help in optimizing GPU usage.
Here's a basic example of how you might set up and use the RetrieverQueryEngine
with asynchronous queries:
import asyncio
from llama_index.core.query_engine import RetrieverQueryEngine
# Assume retriever is already defined
query_engine = RetrieverQueryEngine(retriever)
async def process_queries(queries):
tasks = [query_engine.aretrieve(query) for query in queries]
results = await asyncio.gather(*tasks)
return results
queries = ["query1", "query2", "query3"] # Example queries
results = asyncio.run(process_queries(queries))
This example demonstrates how to run multiple queries asynchronously, which can help in utilizing the GPU more effectively by processing several queries at once.
To continue talking to Dosu, mention @dosu.
Question Validation
Question
GPU usage is quite low. Wondering if there's something I'm missing.
usage is around 1602MiB, while there is plenty of space available 1741MiB / 12288MiB