[Question]: GPU low usage

AlbertoMQ commented 1 month ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

GPU usage is quite low. Wondering if there's something I'm missing.

usage is around 1602MiB, while there is plenty of space available 1741MiB / 12288MiB

default_llm = llm = HuggingFaceLLM(
    # model_name="TheBloke/zephyr-7b-beta",
    # tokenizer_name="TheBloke/zephyr-7b-beta",
    model_name="HuggingFaceH4/zephyr-7b-beta",
    tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
    context_window=1028,
    max_new_tokens=256,
    generate_kwargs={"top_k": 10, "do_sample": True},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="auto",
)

dosubot[bot] commented 1 month ago

The low GPU usage you're experiencing with the model "HuggingFaceH4/zephyr-7b-beta" could be due to the quantization settings in your configuration. The model is likely being loaded in 4-bit precision, which reduces the model size and computational requirements, leading to lower GPU usage. This is controlled by the BitsAndBytesConfig with load_in_4bit=True. If the model fails to load with quantization due to missing CUDA, it defaults to a full precision model, which might not fully utilize the GPU if the system's resources are not optimal for such a configuration ^[1].

To continue talking to Dosu, mention @dosu.

AlbertoMQ commented 1 month ago

@dosu I hadn't set the BitsAndBytesConfig anywhere, but perhaps it is quantized. Is there a suggested way to run multiple queries at once to fully utilize the GPU?

dosubot[bot] commented 1 month ago

To run multiple queries at once and optimize GPU usage in the LlamaIndex framework, you can use the batched function from the NVIDIARerank class. This function allows you to process multiple queries in batches, which can help in efficiently utilizing the GPU. Here's a brief overview of how it works:

def batched(ls: list, size: int) -> Generator[List[NodeWithScore], None, None]:
    for i in range(0, len(ls), size):
        yield ls[i : i + size]

This function divides a list into smaller batches of a specified size. You can use it to batch your queries and process them together, which can lead to better GPU utilization by handling multiple queries simultaneously ^[1].

To continue talking to Dosu, mention @dosu.

AlbertoMQ commented 1 month ago

@dosu I don't want to rerank. I want to run batched RetrieverQueryEngine

dosubot[bot] commented 1 month ago

To run batched queries with the RetrieverQueryEngine in the LlamaIndex framework, you can utilize the asynchronous capabilities of the engine. The RetrieverQueryEngine supports parallel processing of queries through its asynchronous methods, such as aretrieve. This allows you to handle multiple queries in parallel, which can help in optimizing GPU usage.

Here's a basic example of how you might set up and use the RetrieverQueryEngine with asynchronous queries:

import asyncio
from llama_index.core.query_engine import RetrieverQueryEngine

# Assume retriever is already defined
query_engine = RetrieverQueryEngine(retriever)

async def process_queries(queries):
    tasks = [query_engine.aretrieve(query) for query in queries]
    results = await asyncio.gather(*tasks)
    return results

queries = ["query1", "query2", "query3"]  # Example queries
results = asyncio.run(process_queries(queries))

This example demonstrates how to run multiple queries asynchronously, which can help in utilizing the GPU more effectively by processing several queries at once.

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

[Question]: GPU low usage #16573

Question Validation

Question