vllm development does not work for tensor-parallel > 1 #2619

Closed lroberts7 closed 9 months ago

lroberts7 commented 9 months ago

I have a local dev build on commit

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ git log -n 1
commit 5265631d15d59735152c8b72b38d960110987f10 (HEAD -> main, origin/main, origin/HEAD)
Author: Vladimir <vladimir.ovsyannikov@gmail.com>
Date:   Fri Jan 26 08:48:17 2024 +0100

    use a correct device when creating OptionalCUDAGuard (#2583)

and I have some local code that is a thin wrapper around LLM class

If i run this with tensor-parallel == 2 I get the following:

roberts@GPU77B9:~/llm_quantization$ FLASK_APP=quantized_flask_app.py FLASK_ENV=debug python3.10 -m flask run 
 * Serving Flask app 'quantized_flask_app.py' (lazy loading)
 * Environment: debug
 * Debug mode: off
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
INFO 2024-01-26 22:03:13,343 abc_etal.py:195 unknown_model_name:unknown_model_version
                             Hello! logging initialized, starting up... 
INFO 2024-01-26 22:03:13,343 abc_etal.py:196 unknown_model_name:unknown_model_version
                             Git commit of model: unknown_git_commit 
INFO 2024-01-26 22:03:13,343 abc_etal.py:197 unknown_model_name:unknown_model_version
                             Git commit of cuda torch base: unknown_git_commit 
INFO 2024-01-26 22:03:14,921 abc_etal.py:200 unknown_model_name:unknown_model_version
                             Compute device available: cuda 
WARNING 01-26 22:03:16 config.py:506] Casting torch.bfloat16 to torch.float16.
WARNING 01-26 22:03:16 config.py:176] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-26 22:03:18,650 ERROR services.py:1329 -- Failed to start the dashboard , return code 1
2024-01-26 22:03:18,650 ERROR services.py:1354 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2024-01-26 22:03:18,651 ERROR services.py:1398 -- 
The last 20 lines of /tmp/ray/session_2024-01-26_22-03-16_731996_3725694/logs/dashboard.log (it contains the error message from the dashboard): 
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 16, in <module>
    from ray.job_submission import JobStatus, JobSubmissionClient
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/job_submission/__init__.py", line 2, in <module>
    from ray.dashboard.modules.job.pydantic_models import DriverInfo, JobDetails, JobType
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/pydantic_models.py", line 4, in <module>
    from ray._private.pydantic_compat import BaseModel, Field, PYDANTIC_INSTALLED
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 100, in <module>
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 58, in monkeypatch_pydantic_2_for_cloudpickle
    pydantic._internal._model_construction.SchemaSerializer = (
AttributeError: module 'pydantic._internal' has no attribute '_model_construction'
2024-01-26 22:03:18,879 INFO worker.py:1673 -- Started a local Ray instance.
[2024-01-26 22:03:19,820 E 3725694 3725694] core_worker.cc:205: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

however, tensor-parallel == 1 works fine:

lroberts@GPU77B9:~/llm_quantization$ FLASK_APP=quantized_flask_app.py FLASK_ENV=debug python3.10 -m flask run 
 * Serving Flask app 'quantized_flask_app.py' (lazy loading)
 * Environment: debug
 * Debug mode: off
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
INFO 2024-01-26 22:04:03,519 abc_etal.py:195 unknown_model_name:unknown_model_version
                             Hello! logging initialized, starting up... 
INFO 2024-01-26 22:04:03,519 abc_etal.py:196 unknown_model_name:unknown_model_version
                             Git commit of model: unknown_git_commit 
INFO 2024-01-26 22:04:03,519 abc_etal.py:197 unknown_model_name:unknown_model_version
                             Git commit of cuda torch base: unknown_git_commit 
INFO 2024-01-26 22:04:05,098 abc_etal.py:200 unknown_model_name:unknown_model_version
                             Compute device available: cuda 
WARNING 01-26 22:04:06 config.py:506] Casting torch.bfloat16 to torch.float16.
WARNING 01-26 22:04:06 config.py:176] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 01-26 22:04:06 llm_engine.py:72] Initializing an LLM engine with config: model='/home/lroberts/NexusRaven-13B-AWQ/', tokenizer='/home/lroberts/NexusRaven-13B-AWQ/presaved_tokenizer', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-26 22:04:23 llm_engine.py:316] # GPU blocks: 4145, # CPU blocks: 327
INFO 01-26 22:04:27 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-26 22:04:27 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-26 22:04:33 model_runner.py:689] Graph capturing finished in 6 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 2024-01-26 22:04:33,205 abc_etal.py:231 unknown_model_name:unknown_model_version
                             Startup completed! 
INFO 2024-01-26 22:04:33,207 _internal.py:224 unknown_model_name:unknown_model_version
                             WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on 
INFO 2024-01-26 22:04:33,207 _internal.py:224 unknown_model_name:unknown_model_version
                             Press CTRL+C to quit 
[OpenAIMessage(role='system', content='You are a helpful assistant.'), OpenAIMessage(role='user', content='Tell me a few reasons why someone might consider higher education. Do not repeat yourself. Response:  ')]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.11s/it]
INFO 2024-01-26 22:05:05,684 _internal.py:224 unknown_model_name:unknown_model_version
                    - - [26/Jan/2024 22:05:05] "POST /sequence-generation/chat/json HTTP/1.1" 200 - 

the message is a simple curl request looks like this: 
curl -v --trace-time -X POST -H "Content-Type: application/json" --data '{"max_tokens": 500, "messages": [{"content": "You are a helpful assistant.","role": "system"}, {"content": "Tell me a few reasons why someone might consider higher education. Do not repeat yourself. Response:  ","role": "user"}], "model": "gpt-3.5-turbo", "temperature": 0}' http://localhost:5000/sequence-generation/chat/json

with response:

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"  There are many reasons why someone might consider higher education. Here are a few:\n\n1. To gain knowledge and skills: Higher education provides students with the opportunity to learn new knowledge and skills that can be applied in their future careers.\n2. To prepare for a career: Many people choose to pursue higher education because it is a way to prepare for a specific career. For example, a student may choose to study business because they want to work in the field.\n3. To gain a competitive edge: Higher education can provide students with a competitive edge in the job market. Many employers require a degree from a reputable institution, and having one can make a candidate more attractive to potential employers.\n4. To develop critical thinking and problem-solving skills: Higher education provides students with the opportunity to develop their critical thinking and problem-solving skills.\n5. To gain a sense of community: Higher education provides students with the opportunity to connect with other students and faculty members, which can help to create a sense of community.\n6. To gain a sense of purpose: Higher education can provide students with a sense of purpose and direction in life.\n7. To gain a sense of accomplishment: Higher education can provide students with a sense of accomplishment and pride in their achievements.\n8. To gain a sense of personal growth: Higher education can provide students with the opportunity to grow and develop as individuals.\n9. To gain a sense of independence: Higher education can provide students with the opportunity to become independent and self-sufficient.\n10. To gain a sense of fulfillment: Higher education can provide students with a sense of fulfillment and satisfaction in their lives.\n\nOverall, higher education can provide students with a wide range of benefits, including the opportunity to gain knowledge and skills, prepare for a career, gain a competitive edge, develop critical thinking and problem-solving skills, gain a sense of community, gain a sense of purpose, gain a sense of accomplishment, gain a sense of personal growth, gain a sense of independence, and gain a sense of fulfillment.","role":"assistant"}}],"created":1706306706,"id":"llama-2-7b-chat-hf","object":"chat.completion","usage":{"completion_tokens":457,"prompt_tokens":49,"total_tokens":506}}

the error in logs from ray indicates some serialization

 1 2024-01-26 21:35:42,363 INFO utils.py:112 -- Get all modules by type: DashboardHeadModule
  2 2024-01-26 21:35:42,407 INFO utils.py:123 -- Module ray.dashboard.modules.actor.actor_head cannot be loaded because we cannot import all dependencies. Install this module using `pip ins    tall 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  3 2024-01-26 21:35:42,429 INFO utils.py:123 -- Module ray.dashboard.modules.event.event_agent cannot be loaded because we cannot import all dependencies. Install this module using `pip in    stall 'ray[default]'` for the full dashboard functionality. Error: No module named 'grpc'
  4 2024-01-26 21:35:42,430 INFO utils.py:123 -- Module ray.dashboard.modules.event.event_head cannot be loaded because we cannot import all dependencies. Install this module using `pip ins    tall 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  5 2024-01-26 21:35:42,431 INFO utils.py:123 -- Module ray.dashboard.modules.healthz.healthz_agent cannot be loaded because we cannot import all dependencies. Install this module using `pi    p install 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  6 2024-01-26 21:35:42,431 INFO utils.py:123 -- Module ray.dashboard.modules.healthz.healthz_head cannot be loaded because we cannot import all dependencies. Install this module using `pip     install 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  7 2024-01-26 21:35:42,450 ERROR dashboard.py:259 -- The dashboard on node GPU77B9 failed with the following error:
  8 Traceback (most recent call last):
  9   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/dashboard.py", line 248, in <module>
 10     loop.run_until_complete(dashboard.run())
 11   File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
 12     return future.result()
 13   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/dashboard.py", line 75, in run
 14     await self.dashboard_head.run()
 15   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/head.py", line 325, in run
 16     modules = self._load_modules(self._modules_to_load)
 17   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/head.py", line 219, in _load_modules
 18     head_cls_list = dashboard_utils.get_all_modules(DashboardHeadModule)
 19   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/utils.py", line 121, in get_all_modules
 20     importlib.import_module(name)
 21   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
 22     return _bootstrap._gcd_import(name[level:], package, level)
 23   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
 24   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
 25   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
 26   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
 27   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
 28   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
 29   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 16, in <module>
 30     from ray.job_submission import JobStatus, JobSubmissionClient
 31   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/job_submission/__init__.py", line 2, in <module>
 32     from ray.dashboard.modules.job.pydantic_models import DriverInfo, JobDetails, JobType
 33   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/pydantic_models.py", line 4, in <module>
 34     from ray._private.pydantic_compat import BaseModel, Field, PYDANTIC_INSTALLED
 35   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 100, in <module>
 36     monkeypatch_pydantic_2_for_cloudpickle()
 37   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 58, in monkeypatch_pydantic_2_for_cloudpickle
 38     pydantic._internal._model_construction.SchemaSerializer = (
 39 AttributeError: module 'pydantic._internal' has no attribute '_model_construction'

relevant details about env:

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import pydantic; print(pydantic.__version__)"
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import ray; print(ray.__version__)"
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import torch; print(torch.__version__)"

It seems there a known fix or workaround here -> https://github.com/ray-project/ray/issues/41913#issuecomment-1856452166

but it seems that pydantic version 2 is necessary for openai testing


is there a suggested workaround or should I manually downgrade pydantic to version lower than 2.0.0?

lroberts7 commented 9 months ago

nvidia device info:

lroberts@GPU77B9:~/llm_quantization$ nvidia-smi 
Fri Jan 26 22:08:29 2024       
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB           On | 00000000:07:00.0 Off |                    0 |
| N/A   34C    P0               74W / 400W|  62027MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
|   1  NVIDIA A100-SXM4-80GB           On | 00000000:0A:00.0 Off |                    0 |
| N/A   31C    P0               65W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
|   2  NVIDIA A100-SXM4-80GB           On | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0               64W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
|   3  NVIDIA A100-SXM4-80GB           On | 00000000:4D:00.0 Off |                    0 |
| N/A   35C    P0               68W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
|   4  NVIDIA A100-SXM4-80GB           On | 00000000:87:00.0 Off |                    0 |
| N/A   36C    P0               68W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
|   5  NVIDIA A100-SXM4-80GB           On | 00000000:8D:00.0 Off |                    0 |
| N/A   33C    P0               69W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
|   6  NVIDIA A100-SXM4-80GB           On | 00000000:C7:00.0 Off |                    0 |
| N/A   32C    P0               66W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
|   7  NVIDIA A100-SXM4-80GB           On | 00000000:CA:00.0 Off |                    0 |
| N/A   35C    P0               67W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |

| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|    0   N/A  N/A   3727132      C   python3.10                                62014MiB |
yippp commented 9 months ago

also meet module 'pydantic._internal' has no attribute '_model_construction' in the newest dev ver

simon-mo commented 9 months ago

Looks like this is fixed in Ray 2.9 https://github.com/ray-project/ray/issues/41913#issuecomment-1858319354. Try upgrading Ray? We will make sure to lower bound the Ray version as well.

yippp commented 9 months ago

Looks like this is fixed in Ray 2.9 ray-project/ray#41913 (comment). Try upgrading Ray? We will make sure to lower bound the Ray version as well.

I have tried ray=2.9.1 with newest dev code

vllm.entrypoints.openai.api_server --model ./Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host --port 8081 --tensor-parallel-size 2 but I meet another error Failed: Cuda error /home/ysq/vllm/csrc/custom_all_reduce.cuh:417 'resource already mapped' Segmentation fault (core dumped)

simon-mo commented 9 months ago

That seems to be a different issue, please open another ticket and I can try reproducing it.

Update, it did try to reproduce it. With the latest main branch:

INFO 01-28 23:03:50 llm_engine.py:317] # GPU blocks: 15130, # CPU blocks: 4096
INFO 01-28 23:03:52 model_runner.py:626] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 23:03:52 model_runner.py:630] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=5246) INFO 01-28 23:03:52 model_runner.py:626] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=5246) INFO 01-28 23:03:52 model_runner.py:630] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 23:03:58 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
INFO 01-28 23:03:58 model_runner.py:691] Graph capturing finished in 6 secs.
(RayWorkerVllm pid=5246) INFO 01-28 23:03:58 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
(RayWorkerVllm pid=5246) INFO 01-28 23:03:58 model_runner.py:691] Graph capturing finished in 6 secs.
INFO 01-28 23:03:58 serving_chat.py:260] Using default chat template:
INFO 01-28 23:03:58 serving_chat.py:260] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO:     Started server process [3303]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on (Press CTRL+C to quit)
INFO 01-28 23:04:08 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:18 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:28 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:38 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:48 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:58 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:08 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:18 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:28 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:38 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:48 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:58 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:06:08 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:06:10 async_llm_engine.py:431] Received request cmpl-2b94e87fa6e5414b9b2369ec6f77e666: prompt: '<s>[INST] Say this is a test! [/INST]', prefix_pos: None,sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32753, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [1, 1, 733, 16289, 28793, 15753, 456, 349, 264, 1369, 28808, 733, 28748, 16289, 28793], lora_request: None.
INFO 01-28 23:06:11 async_llm_engine.py:110] Finished request cmpl-2b94e87fa6e5414b9b2369ec6f77e666.
INFO: - "POST /v1/chat/completions HTTP/1.1" 200 OK
lroberts7 commented 8 months ago

Looks like this is fixed in Ray 2.9 ray-project/ray#41913 (comment). Try upgrading Ray? We will make sure to lower bound the Ray version as well.

I have tried ray=2.9.1 with newest dev code

vllm.entrypoints.openai.api_server --model ./Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host --port 8081 --tensor-parallel-size 2 but I meet another error Failed: Cuda error /home/ysq/vllm/csrc/custom_all_reduce.cuh:417 'resource already mapped' Segmentation fault (core dumped)

as a work around I think you could pass the flag --disable-custom-all-reduce to not use those custom all-reduce kernels.

simon-mo commented 8 months ago

I believe #2642 might fix "resource already mapped", please try with latest main, sorry about the back and forth.

lroberts7 commented 8 months ago

@simon-mo This works for me: python -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host --port 8081 --tensor-parallel-size 2

some stdout:

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host --port 8081 --tensor-parallel-size 2 
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
INFO 01-29 22:22:09 api_server.py:209] args: Namespace(host='', port=8081, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 01-29 22:22:09 config.py:177] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-29 22:22:12,853 INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-29 22:22:14 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, seed=0)
(raylet) /usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
(raylet)   warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
INFO 01-29 22:22:22 weight_utils.py:164] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:22 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 01-29 22:22:26 llm_engine.py:322] # GPU blocks: 66274, # CPU blocks: 4096
INFO 01-29 22:22:29 model_runner.py:632] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-29 22:22:29 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:29 model_runner.py:632] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:29 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-29 22:22:33 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
INFO 01-29 22:22:33 model_runner.py:698] Graph capturing finished in 4 secs.
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:33 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:33 model_runner.py:698] Graph capturing finished in 4 secs.
INFO 01-29 22:22:34 serving_chat.py:260] Using default chat template:
INFO 01-29 22:22:34 serving_chat.py:260] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO:     Started server process [3928411]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on (Press CTRL+C to quit)
INFO 01-29 22:22:44 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-29 22:22:54 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-29 22:23:04 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-29 22:23:14 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

also works for tensor-parallel 8 (all the ones on this machine)

specs on A100 machine I'm using:

 NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |

python env:

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import vllm, ray, torch, pydantic; print(vllm.__version__); print(ray.__version__); print(torch.__version__); print(pydantic.__version__)"
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ git log -n 2
commit ea8489fce266d69f2fbe314c1385956b1a342e12 (HEAD -> main, origin/main, origin/HEAD)
Author: Rasmus Larsen <rlarsen@pm.me>
Date:   Mon Jan 29 19:52:31 2024 +0100

    ROCm: Allow setting compilation target (#2581)

commit 1b20639a43e811f4469e3cfa543cf280d0d76265
Author: Hanzhi Zhou <hanzhi713@163.com>
Date:   Tue Jan 30 02:46:29 2024 +0800

    No repeated IPC open (#2642)

I don't think the inplace error -> https://github.com/vllm-project/vllm/issues/2620 is resolved though. I still see that one.