[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks

sigridjineth commented 3 months ago

Your current environment

A100 x 8, ubuntu

🐛 Describe the bug

hello, I am trying to run vllm inference behind the fastapi's server, but it stucks at Using model weights format ['*.safetensors']. Are there anyone experiencing such a case?

2024-03-31 02:05:20,110 INFO sqlalchemy.engine.Engine COMMIT
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8009 (Press CTRL+C to quit)
2024-03-31 02:05:21,902 INFO worker.py:1752 -- Started a local Ray instance.
/home/sionic/sigrid/logickor-pipeline/logickor_uv_pipeline/services/generator.py:21: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  SINGLE_TURN_TEMPLATE, DOUBLE_TURN_TEMPLATE = df_config[0], df_config[1]
INFO 03-31 02:05:23 config.py:433] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-03-31 02:05:23,253 INFO worker.py:1585 -- Calling ray.init() again after it has already been called.
INFO 03-31 02:05:23 llm_engine.py:87] Initializing an LLM engine with config: model='maywell/Synatra-kiqu-7B', tokenizer='maywell/Synatra-kiqu-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-31 02:05:57 weight_utils.py:163] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=872183) INFO 03-31 02:05:57 weight_utils.py:163] Using model weights format ['*.safetensors']

The code I am using is like the below.

@asynccontextmanager
async def lifespan(app: FastAPI):
    background_task = asyncio.create_task(start_background_process())
    await create_db_and_tables()
    yield
    background_task.cancel()
    try:
        await background_task
    except asyncio.CancelledError:
        pass
    await close_db()

async def start_background_process():
    while True:
        async with AsyncSessionLocal() as session:
            try:
                await process_evaluation_requests(session)
            except Exception as e:
                print(f"Error processing request: {e}")
            finally:
                await asyncio.sleep(1)

-----------------------------------------
import asyncio

from sqlmodel import select
from sqlmodel.ext.asyncio.session import AsyncSession

from logickor_uv_pipeline.models.evaluation.request import Evaluation
from logickor_uv_pipeline.services.generator import generate

async def process_evaluation_requests(session: AsyncSession):
    output_path = "./output/generate"
    while True:
        async with session.begin():
            statement = select(Evaluation).where(Evaluation.status == "pending")
            results = await session.execute(statement)
            requests = results.scalars().all()

            for request in requests:
                try:
                    output_file_name = await generate(request, output_path)
                    request.status = "success" if output_file_name else "failed"
                except Exception as e:
                    print(str(e))
                    request.status = "failed"

        await session.commit()
        await asyncio.sleep(10)

-----------------------------------------------------------------------
import os

import pandas as pd
from vllm import LLM, SamplingParams
import ray

async def generate(request, output_path):
    try:
        # Check if Ray is initialized; if not, initialize Ray
        if not ray.is_initialized():
            ray.init()

        os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"
        gpu_counts = len("4,5,6,7".split(","))

        df_config = pd.read_json(
            "./logickor_uv_pipeline/services/LogicKor/templates/template-EEVE.json",
            typ="series",
        )
        SINGLE_TURN_TEMPLATE, DOUBLE_TURN_TEMPLATE = df_config[0], df_config[1]

        llm = LLM(
            model=request.model_name,
            tensor_parallel_size=gpu_counts,
            max_model_len=4096,
            gpu_memory_utilization=0.8,
        )
        sampling_params = SamplingParams(
            temperature=0,
            top_p=1,
            top_k=-1,
            early_stopping=True,
            best_of=4,
            use_beam_search=True,
            skip_special_tokens=False,
            max_tokens=4096,
            stop=["", "</s>", "", "[INST]", "[/INST]"],
        )

        df_questions = pd.read_json(
            "./logickor_uv_pipeline/services/LogicKor/questions.jsonl", lines=True
        )

        def format_single_turn_question(question):
            return SINGLE_TURN_TEMPLATE.format(question=question)

        single_turn_questions = df_questions["question"].map(format_single_turn_question)
        single_turn_outputs = [
            output.outputs[0].text.strip()
            for output in await llm.generate(
                single_turn_questions.tolist(), sampling_params
            )
        ]

        def format_double_turn_question(question, single_turn_output):
            return DOUBLE_TURN_TEMPLATE.format(
                question=question, single_turn_output=single_turn_output
            )

        multi_turn_questions = [
            format_double_turn_question(question, single_turn_outputs[idx])
            for idx, question in enumerate(df_questions["question"])
        ]

        multi_turn_outputs = [
            output.outputs[0].text.strip()
            for output in await llm.generate(multi_turn_questions, sampling_params)
        ]

        df_output = pd.DataFrame(
            {
                "id": df_questions["id"],
                "category": df_questions["category"],
                "question": df_questions["question"],
                "single_turn_output": single_turn_outputs,
                "multi_turn_output": multi_turn_outputs,
                "reference": df_questions["reference"],
            }
        )

        output_file_name = f"{request.model_name.replace('/', '_')}.jsonl"

        df_output.to_json(
            os.path.join(output_path, output_file_name),
            orient="records",
            lines=True,
            force_ascii=False,
        )
        return output_file_name

    except Exception as e:
        # Handle any errors here
        print(f"An error occurred: {e}")
        raise e
    finally:
        # Ensure Ray is shut down to prevent issues with reinitialization
        if ray.is_initialized():
            ray.shutdown()

youkaichao commented 3 months ago

Hi, please paste your environment using https://github.com/vllm-project/vllm/blob/main/collect_env.py , so that we can help you better.

sigridjineth commented 3 months ago

@youkaichao I tried to run it before but got this error - can you help me out

(sionic) sionic@iZmj7ir0ircgij46j89st9Z:~/sigrid/vllm$ python ./collect_env.py
Collecting environment information...
Traceback (most recent call last):
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 719, in <module>
    main()
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 698, in main
    output = get_pretty_env_info()
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 693, in get_pretty_env_info
    return pretty_str(get_env_info())
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 499, in get_env_info
    pip_version, pip_list_output = get_pip_packages(run_lambda)
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 469, in get_pip_packages
    out = run_with_pip([sys.executable, '-mpip'])
  File "/home/sionic/sigrid/vllm/./collect_env.py", line 465, in run_with_pip
    return "\n".join(line for line in out.splitlines()
AttributeError: 'NoneType' object has no attribute 'splitlines'

youkaichao commented 3 months ago

This is strange. Your environment might be broken. What happens when you manually execute python -mpip list --format=freeze ?

sigridjineth commented 3 months ago

@youkaichao I am using uv manager, which is Rust-based python package manager.

and here's the uv pip freeze:

(logickor-pipeline) sionic@iZmj7ir0ircgij46j89st9Z:~/sigrid/logickor-pipeline$ uv pip freeze
aiosignal==1.3.1
aiosqlite==0.20.0
annotated-types==0.6.0
anyio==4.3.0
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cupy-cuda12x==12.1.0
diskcache==5.6.3
distro==1.9.0
dnspython==2.6.1
email-validator==2.1.1
exceptiongroup==1.2.0
fastapi==0.110.0
fastrlock==0.8.2
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.3.1
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.22.2
idna==3.6
interegular==0.3.3
isort==5.13.2
jinja2==3.1.3
joblib==1.3.2
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
lark==1.1.9
llvmlite==0.42.0
markupsafe==2.1.5
mpmath==1.3.0
msgpack==1.0.8
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
numba==0.59.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
openai==1.14.3
outlines==0.0.37
packaging==24.0
pandas==2.2.1
prometheus-client==0.20.0
protobuf==5.26.1
psutil==5.9.8
pydantic==2.6.4
pydantic-core==2.16.3
pydantic-settings==2.2.1
pynvml==11.5.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
pyyaml==6.0.1
ray==2.10.0
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
rpds-py==0.18.0
ruff==0.3.4
safetensors==0.4.2
scipy==1.12.0
sentencepiece==0.2.0
six==1.16.0
sniffio==1.3.1
sqlalchemy==2.0.29
sqlmodel==0.0.16
starlette==0.36.3
sympy==1.12
tokenizers==0.15.2
torch==2.1.2
tqdm==4.66.2
transformers==4.39.2
triton==2.1.0
typing-extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
uvicorn==0.29.0
uvloop==0.19.0
vllm==0.3.3
watchfiles==0.21.0
websockets==12.0
xformers==0.0.23.post1

sigridjineth commented 3 months ago

@youkaichao this issue happens same in Docker container

youkaichao commented 3 months ago

I don't know if uv is supported by vllm (most likely no). I would recommend using conda instead.

sigridjineth commented 3 months ago

@youkaichao uv uses virtualenv under the hood, so you mean only conda is supported for vllm library?

youkaichao commented 3 months ago

I would say conda is the most tested, and I wouldn't be surprised if virtualenv or uv does not work for vllm.

sigridjineth commented 3 months ago

okay, are there anyone trying to run vllm in docker settings?

I encountered the most of my times dealing with An error occurred: NCCLBackend is not available. Please install cupy. when initializing llm instance in docker container.

youkaichao commented 3 months ago

First I suggest you switch to conda , the problem might be improper package management and some dependency like cupy is corrupted.

Second, which version of vllm do you use? We recently removed the cupy dependency , and also released v0.4.0 . You can try the new version.

sigridjineth commented 3 months ago

@youkaichao okay, will try new version.

robertgshaw2-neuralmagic commented 3 months ago

@sigridjineth Just curious - why not run just with the vllm api server as opposed to rebuilding your own?

The API server code you have written is not the right way to use the LLM class. In your /generate method, you are creating a whole new instance of an LLM, which [loads the models weights from disk, runs the profiler steps to see how much memory there is, allocates the full KV cache, etc]. Since each request is passed to generate, you will have a long time for each request :)

The way our API server works is that we [ load the models weights from disk, runs the profiler steps to see how much memory there is, allocates the full KV cache ] once, then during inference time we use this state. If you really do need to build an API server yourself rather than using the interfaces we provide, I would suggest looking in vllm/entrypoints/api_server.py for inspiration on how to do things properly

But you should have a very good reason for remaking this yourself

tsvisab commented 2 months ago

Hey @sigridjineth , regarding you "stuck init" issue, how are you starting your container? are you by any chance running the container using Sagemaker or Vertex AI? in any case, i would guess that you are probably lacking shared memory for gpus inter communication so if you start the docker directly, run it with --shm-size="SOME_SIZEgb", also, make sure that container has enough storage for downloading the model shards, using VLLM you can do:

model = LLM(.. ,download_dir="/dev/shm/cache/some_sub_dir_name_if_you_wish",)

and, if it still fails, before you load the model, add:

        ray_tmp_dir = "/dev/shm/tmp/ray"
        os.makedirs(ray_tmp_dir, exist_ok=True)
        ray.init(_temp_dir=ray_tmp_dir, num_gpus=model_config.tensor_parallel_size)

DarkLight1337 commented 1 month ago

We have added documentation for this situation in #5430. Please take a look.

vllm-project / vllm

[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks #3747

Your current environment

🐛 Describe the bug