[Question]: how to use the llamaindex+vllm correctly?

lambda7xx commented 1 month ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I install the llamaindex with the command pip install llama-index and install the vllm pip install vllm. The version of vllm is 0.4.2. The version of transformers is 4.40.0. The llamaindex version is 0.10.43

I run the following code from the document

from llama_index.llms.vllm import Vllm

llm = Vllm(
    model="microsoft/Orca-2-7b",
    # tensor_parallel_size=4,
    max_new_tokens=100,
    vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
)

llm.complete(
    ["[INST]You are a helpful assistant[/INST] What is a black hole ?"]
)

The error log is

/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
INFO 06-04 03:04:17 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='microsoft/Orca-2-7b', speculative_config=None, tokenizer='microsoft/Orca-2-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=microsoft/Orca-2-7b)
INFO 06-04 03:04:18 utils.py:660] Found nccl from library /home/llll/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 06-04 03:04:20 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
INFO 06-04 03:04:20 selector.py:32] Using XFormers backend.
INFO 06-04 03:04:23 weight_utils.py:199] Using model weights format ['*.bin']
INFO 06-04 03:04:35 model_runner.py:175] Loading model weights took 12.5532 GB
INFO 06-04 03:04:35 gpu_executor.py:114] # GPU blocks: 3331, # CPU blocks: 128
INFO 06-04 03:04:36 model_runner.py:937] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-04 03:04:36 model_runner.py:941] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-04 03:04:43 model_runner.py:1017] Graph capturing finished in 7 secs.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/llll/uw_llama_index/0603_explore/test_vllm.py", line 13, in <module>
[rank0]:     llm.complete(
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 223, in wrapper
[rank0]:     result = func(*args, **kwargs)
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/llama_index/core/llms/callbacks.py", line 389, in wrapped_llm_predict
[rank0]:     f_return_val = f(_self, *args, **kwargs)
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/llama_index/llms/vllm/base.py", line 252, in complete
[rank0]:     outputs = self._client.generate([prompt], sampling_params)
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 207, in generate
[rank0]:     self._add_request(
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 230, in _add_request
[rank0]:     self.llm_engine.add_request(request_id,
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 416, in add_request
[rank0]:     prompt_token_ids = self.encode_request(
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 348, in encode_request
[rank0]:     prompt_token_ids = self.tokenizer.encode(request_id=request_id,
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/tokenizer_group.py", line 42, in encode
[rank0]:     return tokenizer.encode(prompt)
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2629, in encode
[rank0]:     encoded_inputs = self.encode_plus(
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3037, in encode_plus
[rank0]:     return self._encode_plus(
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus
[rank0]:     batched_output = self._batch_encode_plus(
[rank0]:   File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
[rank0]:     encodings = self._tokenizer.encode_batch(
[rank0]: TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
Exception ignored in: <function Vllm.__del__ at 0x7fc5673879a0>
Traceback (most recent call last):
  File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/llama_index/llms/vllm/base.py", line 217, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down

It seems llamaindex use vllm has some problem. Maybe I should install the correct version?

lambda7xx commented 1 month ago

if I use vllm==0.4.3, the error is

/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
INFO 06-04 03:11:04 config.py:1130] Casting torch.float32 to torch.float16.
INFO 06-04 03:11:04 config.py:1151] Downcasting torch.float32 to torch.float16.
INFO 06-04 03:11:04 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='microsoft/Orca-2-7b', speculative_config=None, tokenizer='microsoft/Orca-2-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=microsoft/Orca-2-7b)
INFO 06-04 03:11:09 weight_utils.py:207] Using model weights format ['*.bin']
INFO 06-04 03:11:20 model_runner.py:146] Loading model weights took 12.5532 GB
INFO 06-04 03:11:20 gpu_executor.py:83] # GPU blocks: 3355, # CPU blocks: 128
INFO 06-04 03:11:21 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-04 03:11:21 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-04 03:11:27 model_runner.py:924] Graph capturing finished in 6 secs.
type(inpits):<class 'dict'>

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  5.95it/s, Generation Speed: 136.91 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  5.95it/s, Generation Speed: 136.91 toks/s]
Exception ignored in: <function Vllm.__del__ at 0x7f05d599f9a0>
Traceback (most recent call last):
  File "/home/llll/anaconda3/envs/llama_index/lib/python3.10/site-packages/llama_index/llms/vllm/base.py", line 217, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down

logan-markewich commented 1 month ago

@lambda7xx that second log wit 0.4.3 is not an error, it ran correctly, vllm just does some wacky stuff when shutting down rhe process. If you actually printed the output of the llm.complete call, it would print just fine.

Seems like in newer versions, they added some different expected input type. The integration would need to be updated to handle the newer version.

lambda7xx commented 1 month ago

@lambda7xx that second log wit 0.4.3 is not an error, it ran correctly, vllm just does some wacky stuff when shutting down rhe process. If you actually printed the output of the llm.complete call, it would print just fine.

Seems like in newer versions, they added some different expected input type. The integration would need to be updated to handle the newer version.

Thanks. I add print. It's still error and doesn't print. @logan-markewich


 Business Insider's official newsletter gives you a detailed explanation of the concept of a black hole.
Exception ignored in: <function Vllm.__del__ at 0x7f27334739a0>
Traceback (most recent call last):
  File "/home/lllll/anaconda3/envs/llama_index/lib/python3.10/site-packages/llama_index/llms/vllm/base.py", line 217, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down

logan-markewich commented 1 month ago

@lambda7xx its not real error though -- you can see the response got printed fine, the execution of your script was not interrupted

This is an error raised during shutdown, and it is benign

lambda7xx commented 1 month ago

I see the print. it print nothing.

INFO 06-04 17:06:04 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='microsoft/Orca-2-7b', speculative_config=None, tokenizer='microsoft/Orca-2-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=microsoft/Orca-2-7b)
INFO 06-04 17:06:09 weight_utils.py:207] Using model weights format ['*.bin']
INFO 06-04 17:06:21 model_runner.py:146] Loading model weights took 12.5532 GB
INFO 06-04 17:06:21 gpu_executor.py:83] # GPU blocks: 3355, # CPU blocks: 128
INFO 06-04 17:06:22 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-04 17:06:22 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-04 17:06:29 model_runner.py:924] Graph capturing finished in 6 secs.
type(inpits):<class 'dict'>
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.91it/s, Generation Speed: 136.11 toks/s]

 Business Insider's official newsletter gives you a detailed explanation of the concept of a black hole.
Exception ignored in: <function Vllm.__del__ at 0x7efd6d0739a0>
Traceback (most recent call last):

lambda7xx commented 1 month ago

@logan-markewich sorry to bother you. could you help me resolve this? Thanks

logan-markewich commented 1 month ago

@lambda7xx but it did print

Business Insider's official newsletter gives you a detailed explanation of the concept of a black hole.

logan-markewich commented 1 month ago

Then, that's the end of the script. It runs fine 😅

lambda7xx commented 1 month ago

Then, that's the end of the script. It runs fine 😅

oh my god, sorry, I miss it.

logan-markewich commented 1 month ago

Lol no worries, lots of text there

run-llama / llama_index

[Question]: how to use the llamaindex+vllm correctly? #13925

Question Validation

Question