run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.17k stars 5.16k forks source link

[Question]: GPU low usage #16573

Open AlbertoMQ opened 3 days ago

AlbertoMQ commented 3 days ago

Question Validation

Question

GPU usage is quite low. Wondering if there's something I'm missing.

usage is around 1602MiB, while there is plenty of space available 1741MiB / 12288MiB

default_llm = llm = HuggingFaceLLM(
    # model_name="TheBloke/zephyr-7b-beta",
    # tokenizer_name="TheBloke/zephyr-7b-beta",
    model_name="HuggingFaceH4/zephyr-7b-beta",
    tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
    context_window=1028,
    max_new_tokens=256,
    generate_kwargs={"top_k": 10, "do_sample": True},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="auto",
)
dosubot[bot] commented 3 days ago

The low GPU usage you're experiencing with the model "HuggingFaceH4/zephyr-7b-beta" could be due to the quantization settings in your configuration. The model is likely being loaded in 4-bit precision, which reduces the model size and computational requirements, leading to lower GPU usage. This is controlled by the BitsAndBytesConfig with load_in_4bit=True. If the model fails to load with quantization due to missing CUDA, it defaults to a full precision model, which might not fully utilize the GPU if the system's resources are not optimal for such a configuration [1].

To continue talking to Dosu, mention @dosu.

AlbertoMQ commented 3 days ago

@dosu I hadn't set the BitsAndBytesConfig anywhere, but perhaps it is quantized. Is there a suggested way to run multiple queries at once to fully utilize the GPU?

dosubot[bot] commented 3 days ago

To run multiple queries at once and optimize GPU usage in the LlamaIndex framework, you can use the batched function from the NVIDIARerank class. This function allows you to process multiple queries in batches, which can help in efficiently utilizing the GPU. Here's a brief overview of how it works:

def batched(ls: list, size: int) -> Generator[List[NodeWithScore], None, None]:
    for i in range(0, len(ls), size):
        yield ls[i : i + size]

This function divides a list into smaller batches of a specified size. You can use it to batch your queries and process them together, which can lead to better GPU utilization by handling multiple queries simultaneously [1].

To continue talking to Dosu, mention @dosu.

AlbertoMQ commented 2 days ago

@dosu I don't want to rerank. I want to run batched RetrieverQueryEngine

dosubot[bot] commented 2 days ago

To run batched queries with the RetrieverQueryEngine in the LlamaIndex framework, you can utilize the asynchronous capabilities of the engine. The RetrieverQueryEngine supports parallel processing of queries through its asynchronous methods, such as aretrieve. This allows you to handle multiple queries in parallel, which can help in optimizing GPU usage.

Here's a basic example of how you might set up and use the RetrieverQueryEngine with asynchronous queries:

import asyncio
from llama_index.core.query_engine import RetrieverQueryEngine

# Assume retriever is already defined
query_engine = RetrieverQueryEngine(retriever)

async def process_queries(queries):
    tasks = [query_engine.aretrieve(query) for query in queries]
    results = await asyncio.gather(*tasks)
    return results

queries = ["query1", "query2", "query3"]  # Example queries
results = asyncio.run(process_queries(queries))

This example demonstrates how to run multiple queries asynchronously, which can help in utilizing the GPU more effectively by processing several queries at once.

To continue talking to Dosu, mention @dosu.