run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.04k stars 5.13k forks source link

Query execution is happening on CPU, even if model is loaded on GPU #4554

Closed suraj-gade closed 1 year ago

suraj-gade commented 1 year ago

Hi,

I am building a chatbot using LLM like fastchat-t5-3b-v1.0 and want to reduce my inference time.

I am loading the entire model on GPU, using device_map parameter, and making use of hugging face pipeline agent for querying the LLM model. Also specifying the device=0 ( which is the 1st rank GPU) for hugging face pipeline as well. I am monitoring the GPU and CPU usage throughout the entire execution, and I can see that though my model is on GPU, at the time of querying the model, it makes use of CPU. The spike in CPU usage shows that query execution is happening on CPU.

Below is the code that I am using to do inference on Fastchat LLM.

from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex, PromptHelper, LLMPredictor
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext
from transformers import T5Tokenizer, T5ForConditionalGeneration
from accelerate import init_empty_weights, infer_auto_device_map

model_name = 'lmsys/fastchat-t5-3b-v1.0'

config = T5Config.from_pretrained(model_name )
with init_empty_weights():
    model_layer = T5ForConditionalGeneration(config=config)

device_map = infer_auto_device_map(model_layer, max_memory={0: "12GiB",1: "12GiB", "cpu": "0GiB"}, no_split_module_classes=["T5Block"])

# the value for is : device_map = {'': 0}. i.e loading model in 1st GPU

model = T5ForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.float16, device_map=device_map, offload_folder="offload", offload_state_dict=True)

tokenizer = T5Tokenizer.from_pretrained(model_name)

from transformers import pipeline

pipe = pipeline(
    "text2text-generation", model=model, tokenizer=tokenizer, device= 0,
    max_length=1536, temperature=0, top_p = 1, num_beams=1, early_stopping=False
)

from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipe)

embed_model = LangchainEmbedding(HuggingFaceEmbeddings())

# set maximum input size
max_input_size = 2048
# set number of output tokens
num_outputs = 512
# set maximum chunk overlap
max_chunk_overlap = 20
# set chunk size limit
chunk_size_limit = 300
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap)

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm_predictor=LLMPredictor(llm), prompt_helper=prompt_helper, chunk_size_limit=chunk_size_limit)

# build index
documents = SimpleDirectoryReader('data').load_data()

new_index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = new_index.as_query_engine(
    verbose=True,
    similarity_top_k=2
)

response = query_engine.query("sample query question?")

Here the “data” folder has my full input text in pdf format, and am using the GPTVectoreStoreIndex and hugging face pipeline to build the index on that and fetch the relevant chunk to generate the prompt with context and query the FastChat model as shown in the code.

Please have a look, and let me know if this is the expected behaviour. how can I make use of GPU for query execution as well? to reduce the inference response time.

Disiok commented 1 year ago

If you do

llm('Test string')

Does it run on CPU or GPU?

Disiok commented 1 year ago

Just trying to understand if it's an issue inside LlamaIndex, or about the configuration of the HuggingFacePipeline

suraj-gade commented 1 year ago

Hi @Disiok , Thanks for response.

Executing this llm('Test string') also utilizes the CPU Only.

logan-markewich commented 1 year ago

Sounds like a configuration problem with huggingface tbh. Going to close this for now. Feel free to re-open if you can confirm the model is actually running on GPU before giving it to llama-index