Open wangpeilin opened 2 months ago
Same bug observed. Exact same behavior on 8xH100 with llama models.
Let me provide more detail on my side. I am working on benchmarking vLLM together with TensorRT-LLM and encountered the same issue when running benchmark on 8xH100 (the same benchmark runs normally on 8xA100 on my side). Docker image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 Hardware: 8xH100 Reproducing command (directly runnable inside the docker container
export HF_TOKEN=<your HF token>
apt update
apt install -y wget unzip
# download benchmarking code
wget -O benchmarking_code.zip https://buildkite.com/organizations/vllm/pipelines/performance-benchmark/builds/8510/jobs/0191b4d9-7ae6-406f-ba11-e7d31b08cd44/artifacts/0191b5f6-2ce6-40d4-8344-beb6fc94f405
unzip benchmarking_code.zip
# remove previous results
rm -r ./benchmarks/results
VLLM_SOURCE_CODE_LOC=$(pwd) bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
This code is from vLLM performance benchmark
It typically crashes when running the test llama8B_tp1_sonnet_512_256_qps_2
.
System Info
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
All request are successfully processed and no error
actual behavior
When the server performs multiple inferences, such as 5000 times, it raise error malloc(): unaligned tcache chunk detected Signal (6) received.
Both continuous and intermittent (such as one day) inference will cause this error.
When I calls 8000 inferences in one test, it raise error pinned_memory_manager.cc:170] "failed to allocate pinned system memory, falling back to non-pinned system memory Finally I set parameter cuda-memory-pool-byte-size to 512M and pinned-memory-pool-byte-size to 512M and solve this problem, but these two parameters are not exposed in the script scripts/launch_triton_server.py, so I want to ask why this problem occurs and if there is any other way to solve this problem.
When I call the server with high concurrency it raise error malloc_consolidate(): unaligned fastbin chunk detected Signal (6) received.
Hope you can help me solve these problems, thanks very much!
additional notes
I think this seems to be because the server does not completely clean up the memory after each inference is completed.