vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.81k stars 4.27k forks source link

[Bug]: "500 Internal Server Error" after upgrade to v0.5.4 #7290

Open tonyaw opened 2 months ago

tonyaw commented 2 months ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

After I upgraded to v0.5.4, got "500 Internal Server Error". My manifest snippet to start vllm:

      containers:
      - name: 8x7b-open
        image: vllm/vllm-openai:v0.5.4
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]                                                                                                                        
        args: ["--model", "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", "--host", "0.0.0.0", "--port", "8080", "--tensor-parallel-size", "2", "--seed", "42", "--trust-remote-code"]                      
        securityContext:                                                                                                                                                                        
          privileged: true                                                                                                                                                                      
        ports:                                                                                                                                                                                  
        - containerPort: 8080                                                                                                                                                                   
        env:                                                                                                                                                                                    
        - name: OMP_NUM_THREADS                                                                                                                                                               
          value: "2"                                                                                                                                                                          
        volumeMounts:                                                                                                                                                                           
          - mountPath: "/root/.cache"                                                                                                                                                           
            name: ceph-volume                                                                                                                                                                   
        resources:                                                                                                                                                                              
          limits:                                                                                                                                                                               
            cpu: '12'                                                                                                                                                                           
            memory: 200Gi                                                                                                                                                                       
            nvidia.com/gpu: '2'                                                                                                                                                                 
          requests:                                                                                                                                                                             
            cpu: '12'                                                                                                                                                                           
            memory: 200Gi                                                                                                                                                                       
            nvidia.com/gpu: '2'                                                                                                                                                                 

Backtrace log:

INFO:     10.254.17.246:59936 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 196, in generate
    with self.socket() as socket:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 59, in socket
    socket = self.context.socket(zmq.constants.DEALER)
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/context.py", line 354, in socket
    socket_class(  # set PYTHONTRACEMALLOC=2 to get the calling frame
  File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 218, in __init__
    super().__init__(context, socket_type, **kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 156, in __init__
    super().__init__(
  File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
zmq.error.ZMQError: Too many open files
tonyaw commented 2 months ago

Also ulimit and lsof info:

root@8x7b-open-deployment-9fb777c9d-mwq8b:/vllm-workspace# lsof | grep pt_main_t | wc -l
26295
root@8x7b-open-deployment-9fb777c9d-mwq8b:/vllm-workspace# ulimit -n
1048576
root@8x7b-open-deployment-9fb777c9d-mwq8b:/vllm-workspace#
youkaichao commented 2 months ago

cc @robertgshaw2-neuralmagic

youkaichao commented 2 months ago

@tonyaw if you want a quick solution, you can try to add --disable-frontend-multiprocessing

tonyaw commented 2 months ago

What's the side effect by adding this parameter "--disable-frontend-multiprocessing"? It isn't caused by OMP_NUM_THREADS=2, right? I have two A100, so OMP_NUM_THREADS shall be 2 right?

Thanks in advance!

youkaichao commented 2 months ago

--disable-frontend-multiprocessing will be slower

usually people don't need to set OMP_NUM_THREADS for vLLM

robertgshaw2-neuralmagic commented 2 months ago

Thanks, I will do an analysis of how many unix sockets are opened up and see if there is anything we can do to reduce the amount, since we currently open a new socket for each generate request

TangJiakai commented 1 month ago

--disable-frontend-multiprocessing will be slower

usually people don't need to set OMP_NUM_THREADS for vLLM

@youkaichao @robertgshaw2-neuralmagic I have set this param --disable-frontend-multiprocessing, but still get the error as follows:

File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1026, in _request
    return self._retry_request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1026, in _request
    return self._retry_request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1026, in _request
    return self._retry_request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/openai/_base_client.py", line 1041, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500 - {'detail': ''}

my vllm version is latest 0.5.5, and the cmd is

python -m vllm.entrypoints.openai.api_server \
        --model /data/pretrain_dir/Meta-Llama-3-8B-Instruct \
        --trust-remote-code \
        --port $port \
        --dtype auto \
        --pipeline-parallel-size 1 \
        --enforce-eager \
        --enable-prefix-caching \
        --enable-lora \
        --disable-frontend-multiprocessing
TangJiakai commented 1 month ago

The interesting thing is that even when I enter only one prompt at a time (to ensure the LLM isn't overloaded) during a certain period for testing the large model, it can still sometimes generate successfully and sometimes fail. The error when it fails is still "Error code: 500 - {'detail': ''}".

youkaichao commented 1 month ago

@TangJiakai this looks like a client side error. do you have the server side error trace?

TangJiakai commented 1 month ago

@TangJiakai this looks like a client side error. do you have the server side error trace?

Yes, you are right! It's happened on client side.