(openchat) user@vsrv-chatgpt:~$ python -m ochat.serving.openai_api_server --model openchat/openchat_3.5 --engine-use-ray --worker-use-ray --tensor-parallel-size 2
FlashAttention not found. Install it if you need to train models.
FlashAttention not found. Install it if you need to train models.
2023-11-13 22:40:36,947 INFO worker.py:1673 -- Started a local Ray instance.
(pid=1681) FlashAttention not found. Install it if you need to train models.
(pid=1681) FlashAttention not found. Install it if you need to train models.
(AsyncTokenizer pid=1681) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2023-11-13 22:40:39,875 INFO worker.py:1507 -- Calling ray.init() again after it has already been called.
(_AsyncLLMEngine pid=1710) INFO 11-13 22:40:42 llm_engine.py:72] Initializing an LLM engine with config: model='openchat/openchat_3.5', tokenizer='openchat/openchat_3.5', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)
(_AsyncLLMEngine pid=1710) WARNING 11-13 22:40:42 config.py:226] Possibly too large swap space. 8.00 GiB out of the 11.68 GiB total CPU memory is allocated for the swap space.
(_AsyncLLMEngine pid=1710) Using blocking ray.get inside async actor. This blocks the event loop. Please use await on object ref with asyncio.gather if you want to yield execution to the event loop instead.
Freezes after startup, I use 2 rtx 3070 video cards routed through esxi, os: ubuntu 22.04 server
This seems to be a warning of vLLM, not an error message. As the model loading typically takes several minutes (depending on disk read speed), can you wait a bit longer?
(openchat) user@vsrv-chatgpt:~$ python -m ochat.serving.openai_api_server --model openchat/openchat_3.5 --engine-use-ray --worker-use-ray --tensor-parallel-size 2 FlashAttention not found. Install it if you need to train models. FlashAttention not found. Install it if you need to train models. 2023-11-13 22:40:36,947 INFO worker.py:1673 -- Started a local Ray instance. (pid=1681) FlashAttention not found. Install it if you need to train models. (pid=1681) FlashAttention not found. Install it if you need to train models. (AsyncTokenizer pid=1681) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2023-11-13 22:40:39,875 INFO worker.py:1507 -- Calling ray.init() again after it has already been called. (_AsyncLLMEngine pid=1710) INFO 11-13 22:40:42 llm_engine.py:72] Initializing an LLM engine with config: model='openchat/openchat_3.5', tokenizer='openchat/openchat_3.5', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0) (_AsyncLLMEngine pid=1710) WARNING 11-13 22:40:42 config.py:226] Possibly too large swap space. 8.00 GiB out of the 11.68 GiB total CPU memory is allocated for the swap space. (_AsyncLLMEngine pid=1710) Using blocking ray.get inside async actor. This blocks the event loop. Please use
await
on object ref with asyncio.gather if you want to yield execution to the event loop instead.Freezes after startup, I use 2 rtx 3070 video cards routed through esxi, os: ubuntu 22.04 server