Closed lroberts7 closed 9 months ago
nvidia device info:
lroberts@GPU77B9:~/llm_quantization$ nvidia-smi
Fri Jan 26 22:08:29 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 34C P0 74W / 400W| 62027MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:0A:00.0 Off | 0 |
| N/A 31C P0 65W / 400W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:47:00.0 Off | 0 |
| N/A 32C P0 64W / 400W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:4D:00.0 Off | 0 |
| N/A 35C P0 68W / 400W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:87:00.0 Off | 0 |
| N/A 36C P0 68W / 400W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:8D:00.0 Off | 0 |
| N/A 33C P0 69W / 400W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:C7:00.0 Off | 0 |
| N/A 32C P0 66W / 400W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:CA:00.0 Off | 0 |
| N/A 35C P0 67W / 400W| 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3727132 C python3.10 62014MiB |
+---------------------------------------------------------------------------------------+
also meet module 'pydantic._internal' has no attribute '_model_construction'
in the newest dev ver
Looks like this is fixed in Ray 2.9 https://github.com/ray-project/ray/issues/41913#issuecomment-1858319354. Try upgrading Ray? We will make sure to lower bound the Ray version as well.
Looks like this is fixed in Ray 2.9 ray-project/ray#41913 (comment). Try upgrading Ray? We will make sure to lower bound the Ray version as well.
I have tried ray=2.9.1
with newest dev code
vllm.entrypoints.openai.api_server --model ./Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2
but I meet another error
Failed: Cuda error /home/ysq/vllm/csrc/custom_all_reduce.cuh:417 'resource already mapped' Segmentation fault (core dumped)
That seems to be a different issue, please open another ticket and I can try reproducing it.
Update, it did try to reproduce it. With the latest main branch:
INFO 01-28 23:03:50 llm_engine.py:317] # GPU blocks: 15130, # CPU blocks: 4096
INFO 01-28 23:03:52 model_runner.py:626] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 23:03:52 model_runner.py:630] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=5246) INFO 01-28 23:03:52 model_runner.py:626] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=5246) INFO 01-28 23:03:52 model_runner.py:630] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 23:03:58 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
INFO 01-28 23:03:58 model_runner.py:691] Graph capturing finished in 6 secs.
(RayWorkerVllm pid=5246) INFO 01-28 23:03:58 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
(RayWorkerVllm pid=5246) INFO 01-28 23:03:58 model_runner.py:691] Graph capturing finished in 6 secs.
INFO 01-28 23:03:58 serving_chat.py:260] Using default chat template:
INFO 01-28 23:03:58 serving_chat.py:260] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO: Started server process [3303]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
INFO 01-28 23:04:08 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:18 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:28 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:38 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:48 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:58 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:08 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:18 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:28 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:38 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:48 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:58 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:06:08 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:06:10 async_llm_engine.py:431] Received request cmpl-2b94e87fa6e5414b9b2369ec6f77e666: prompt: '<s>[INST] Say this is a test! [/INST]', prefix_pos: None,sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32753, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [1, 1, 733, 16289, 28793, 15753, 456, 349, 264, 1369, 28808, 733, 28748, 16289, 28793], lora_request: None.
INFO 01-28 23:06:11 async_llm_engine.py:110] Finished request cmpl-2b94e87fa6e5414b9b2369ec6f77e666.
INFO: 127.0.0.1:57316 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Looks like this is fixed in Ray 2.9 ray-project/ray#41913 (comment). Try upgrading Ray? We will make sure to lower bound the Ray version as well.
I have tried
ray=2.9.1
with newest dev code
vllm.entrypoints.openai.api_server --model ./Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2
but I meet another errorFailed: Cuda error /home/ysq/vllm/csrc/custom_all_reduce.cuh:417 'resource already mapped' Segmentation fault (core dumped)
as a work around I think you could pass the flag --disable-custom-all-reduce
to not use those custom all-reduce kernels.
I believe #2642 might fix "resource already mapped", please try with latest main, sorry about the back and forth.
@simon-mo This works for me:
python -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2
some stdout:
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
INFO 01-29 22:22:09 api_server.py:209] args: Namespace(host='0.0.0.0', port=8081, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 01-29 22:22:09 config.py:177] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-29 22:22:12,853 INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-29 22:22:14 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, seed=0)
(raylet) /usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
INFO 01-29 22:22:22 weight_utils.py:164] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:22 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 01-29 22:22:26 llm_engine.py:322] # GPU blocks: 66274, # CPU blocks: 4096
INFO 01-29 22:22:29 model_runner.py:632] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-29 22:22:29 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:29 model_runner.py:632] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:29 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-29 22:22:33 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
INFO 01-29 22:22:33 model_runner.py:698] Graph capturing finished in 4 secs.
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:33 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:33 model_runner.py:698] Graph capturing finished in 4 secs.
INFO 01-29 22:22:34 serving_chat.py:260] Using default chat template:
INFO 01-29 22:22:34 serving_chat.py:260] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO: Started server process [3928411]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
INFO 01-29 22:22:44 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-29 22:22:54 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-29 22:23:04 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-29 22:23:14 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
also works for tensor-parallel 8 (all the ones on this machine)
specs on A100 machine I'm using:
NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
python env:
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import vllm, ray, torch, pydantic; print(vllm.__version__); print(ray.__version__); print(torch.__version__); print(pydantic.__version__)"
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
0.2.7
2.9.1
2.1.2+cu121
2.6.0
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ git log -n 2
commit ea8489fce266d69f2fbe314c1385956b1a342e12 (HEAD -> main, origin/main, origin/HEAD)
Author: Rasmus Larsen <rlarsen@pm.me>
Date: Mon Jan 29 19:52:31 2024 +0100
ROCm: Allow setting compilation target (#2581)
commit 1b20639a43e811f4469e3cfa543cf280d0d76265
Author: Hanzhi Zhou <hanzhi713@163.com>
Date: Tue Jan 30 02:46:29 2024 +0800
No repeated IPC open (#2642)
I don't think the inplace error -> https://github.com/vllm-project/vllm/issues/2620 is resolved though. I still see that one.
I have a local dev build on commit
and I have some local code that is a thin wrapper around LLM class
If i run this with
tensor-parallel == 2
I get the following:however,
tensor-parallel == 1
works fine:with response:
the error in logs from ray indicates some serialization
relevant details about env:
It seems there a known fix or workaround here -> https://github.com/ray-project/ray/issues/41913#issuecomment-1856452166
but it seems that pydantic version 2 is necessary for openai testing
https://github.com/vllm-project/vllm/blob/3a0e1fc070dc7482ab1c8fcdc961e5729a4cb0b3/requirements.txt#L11
is there a suggested workaround or should I manually downgrade pydantic to version lower than 2.0.0?