vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.04k stars 3.82k forks source link

Error when using nsys profile #3247

Open sleepwalker2017 opened 6 months ago

sleepwalker2017 commented 6 months ago

I want to use nsight system to profile vllm.

Here is my code: Seems by default, Ray is not used, is that right? But my nsys profile still fails. Why is that?

Thank you.

def initialize_engine() -> LLMEngine:
    """Initialize the LLMEngine."""
    # max_loras: controls the number of LoRAs that can be used in the same
    #   batch. Larger numbers will cause higher memory usage, as each LoRA
    #   slot requires its own preallocated tensor.
    # max_lora_rank: controls the maximum supported rank of all LoRAs. Larger
    #   numbers will cause higher memory usage. If you know that all LoRAs will
    #   use the same rank, it is recommended to set this as low as possible.
    # max_cpu_loras: controls the size of the CPU LoRA cache.
    engine_args = EngineArgs(model="/data/models/vicuna-7b-v1.5/",
                             enable_lora=True,
                             max_loras=1,
                             max_lora_rank=8,
                             max_cpu_loras=2,
                             max_num_seqs=256,
                             enforce_eager=True
                             )
    return LLMEngine.from_engine_args(engine_args)
root@bms-airtrunk-d-g18v3-app-10-192-82-3:/data/vllm# nsys nvprof python s-lora.py
WARNING: python and any of its children processes will be profiled.

INFO 03-07 03:26:52 llm_engine.py:87] Initializing an LLM engine with config: model='/data/models/vicuna-7b-v1.5/', tokenizer='/data/models/vicuna-7b-v1.5/', tokenizer_mode=auto, revision=
None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enfo
rce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-07 03:27:11 llm_engine.py:357] # GPU blocks: 1016, # CPU blocks: 512
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 139810.13it/s]
RequestOutput(request_id=0, prompt='A robot may not injure a human being', prompt_token_ids=[1, 319, 19964, 1122, 451, 10899, 545, 263, 5199, 1641], prompt_logprobs=[None, {319: -12.332597732543945, 23196: -4.162
187099456787}, {19964: -9.50893783569336, 29889: -3.8224148750305176}, {1122: -4.00117826461792, 293: -1.6261781454086304}, {451: -1.7928853034973145, 367: -1.1210103034973145}, {10899: -0.7501230239868164}, {545
: -0.00014494798961095512}, {263: -0.0598083920776844}, {5199: -0.056958943605422974}, {1641: -0.07977178692817688}], outputs=[CompletionOutput(index=0, text=' or', token_ids=[470], cumulative_logprob=-0.11687355
488538742, logprobs=[{470: -0.11687355488538742}], finish_reason=length)], finished=True, metrics=RequestMetrics(arrival_time=9761217.903600257, last_token_time=9761218.13178936, first_scheduled_time=1709782039.5
163615, first_token_time=1709782039.7174356, time_in_queue=1700020821.6127613, finished_time=1709782039.7178597), lora_request=None)
Generating '/tmp/nsys-report-4d3a.qdstrm'
[1/7] [========================100%] report2.nsys-rep
Importer error status: Importation succeeded with non-fatal errors.
**** Analysis failed with:
Status: TargetProfilingFailed
Props {
  Items {
    Type: DeviceId
    Value: "Local (CLI)"
  }
}
Error {
  Type: RuntimeError
  SubError {
    Type: ProcessEventsError
    Props {
      Items {
        Type: ErrorText
        Value: "/build/agent/work/323cb361ab84164c/QuadD/Host/Analysis/Modules/TraceProcessEvent.cpp(45): Throw in function const string& {anonymous}::GetCudaCallbackName(bool, uint32_t, const QuadDAnalysis::More
Injection&)\nDynamic exception type: boost::wrapexcept<QuadDCommon::InvalidArgumentException>\nstd::exception::what: InvalidArgumentException\n[QuadDCommon::tag_message*] = Unknown runtime API function index: 440
\n"
      }
    }
  }
}

**** Errors occurred while processing the raw events. ****
**** Please see the Diagnostics Summary page after opening the report file in GUI. ****
mwbyeon commented 6 months ago

what's the version of nsys? I had a similar error with 2024.1.1, but it worked fine with 2023.4.1

sleepwalker2017 commented 5 months ago

what's the version of nsys? I had a similar error with 2024.1.1, but it worked fine with 2023.4.1

NVIDIA Nsight Systems version 2023.3.1.92-233133147223v0

You only try another version nsys? no other modifications?

mwbyeon commented 5 months ago

In my case, nsys threw an error when TP>1. It was solved by just changing the nsys version without any other modifications.

sleepwalker2017 commented 5 months ago

In my case, nsys threw an error when TP>1. It was solved by just changing the nsys version without any other modifications.

Hi, does it throw error when you using the latest version with tp=1?

mwbyeon commented 5 months ago

Tested on v0.3.2, not the latest version of main branch.

$ nsys --version
NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0

$ python -c 'import vllm; print(vllm.__version__)'
0.3.2

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
from vllm import LLMEngine, EngineArgs

def initialize_engine() -> LLMEngine:
    engine_args = EngineArgs(model="lmsys/vicuna-7b-v1.5",
                             max_num_seqs=256,
                             enforce_eager=True
                             )
    return LLMEngine.from_engine_args(engine_args)

initialize_engine()
$ nsys nvprof python test.py
WARNING: python and any of its children processes will be profiled.

INFO 03-11 20:36:23 llm_engine.py:79] Initializing an LLM engine with config: model='lmsys/vicuna-7b-v1.5', tokenizer='lmsys/vicuna-7b-v1.5', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-11 20:36:30 weight_utils.py:163] Using model weights format ['*.bin']
INFO 03-11 20:36:39 llm_engine.py:337] # GPU blocks: 7381, # CPU blocks: 512
Generating '/tmp/nsys-report-9d38.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)   Style              Range
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -------  ---------------------------
     63.5        986343987          1  986343987.0  986343987.0  986343987  986343987          0.0  PushPop  NCCL:ncclCommInitRankConfig
     32.8        509790568          1  509790568.0  509790568.0  509790568  509790568          0.0  PushPop  NCCL:ncclCommAbort
      3.7         57223271          2   28611635.5   28611635.5        902   57222369   40461687.3  PushPop  NCCL:ncclGroupEnd
      0.0             6502          2       3251.0       3251.0       2074       4428       1664.5  PushPop  NCCL:ncclGroupStart
      0.0             5621          1       5621.0       5621.0       5621       5621          0.0  PushPop  NCCL:ncclAllReduce

... <truncated>
sleepwalker2017 commented 5 months ago

Tested on v0.3.2, not the latest version of main branch.

$ nsys --version
NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0

$ python -c 'import vllm; print(vllm.__version__)'
0.3.2

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
from vllm import LLMEngine, EngineArgs

def initialize_engine() -> LLMEngine:
    engine_args = EngineArgs(model="lmsys/vicuna-7b-v1.5",
                             max_num_seqs=256,
                             enforce_eager=True
                             )
    return LLMEngine.from_engine_args(engine_args)

initialize_engine()
$ nsys nvprof python test.py
WARNING: python and any of its children processes will be profiled.

INFO 03-11 20:36:23 llm_engine.py:79] Initializing an LLM engine with config: model='lmsys/vicuna-7b-v1.5', tokenizer='lmsys/vicuna-7b-v1.5', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-11 20:36:30 weight_utils.py:163] Using model weights format ['*.bin']
INFO 03-11 20:36:39 llm_engine.py:337] # GPU blocks: 7381, # CPU blocks: 512
Generating '/tmp/nsys-report-9d38.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)   Style              Range
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -------  ---------------------------
     63.5        986343987          1  986343987.0  986343987.0  986343987  986343987          0.0  PushPop  NCCL:ncclCommInitRankConfig
     32.8        509790568          1  509790568.0  509790568.0  509790568  509790568          0.0  PushPop  NCCL:ncclCommAbort
      3.7         57223271          2   28611635.5   28611635.5        902   57222369   40461687.3  PushPop  NCCL:ncclGroupEnd
      0.0             6502          2       3251.0       3251.0       2074       4428       1664.5  PushPop  NCCL:ncclGroupStart
      0.0             5621          1       5621.0       5621.0       5621       5621          0.0  PushPop  NCCL:ncclAllReduce

... <truncated>

Thank you! I tried with tp=1, nsys profile could run well in this case.