vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
Apache License 2.0
27.6k stars 4.07k forks source link

[Bug]: Issue when benchmarking the dynamically served LoRA adapter #8564

Open ducanh-ho2296 opened 1 week ago

ducanh-ho2296 commented 1 week ago

My current environment

[pip3] numpy==2.1.1
[pip3] nvidia-cublas-cu12==
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==
[pip3] nvidia-cufft-cu12==
[pip3] nvidia-curand-cu12==
[pip3] nvidia-cusolver-cu12==
[pip3] nvidia-cusparse-cu12==
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==25.1.2
[pip3] torch==2.4.1
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] numpy                     2.1.1                    pypi_0    pypi
[conda] nvidia-cublas-cu12                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12                 pypi_0    pypi
[conda] nvidia-cufft-cu12                pypi_0    pypi
[conda] nvidia-curand-cu12               pypi_0    pypi
[conda] nvidia-cusolver-cu12               pypi_0    pypi
[conda] nvidia-cusparse-cu12               pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.68                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pyzmq                     25.1.2          py311h6a678d5_0  
[conda] torch                     2.4.1                    pypi_0    pypi
[conda] transformers              4.44.2                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi```

Model Input Dumps

No response

🐛 Describe the bug

I'M working with serving LoRA adapter dynamically with:


!curl -X POST http://address_to_model/v1/load_lora_adapter \
-H "Content-Type: application/json" \
-d '{"lora_name": "meta-llama/Meta-Llama-3.1-8B-Instruct", "lora_path": "path/to/epoch_9"}'

The model with name meta-llama/Meta-Llama-3.1-8B-Instruct is now running in a kubenetes Pod with a single GPU A100, after that I used lm evaluation harness framework https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#model-apis-and-inference-servers for benchmarking the model:

!lm_eval --model local-completions \
    --tasks mmlu \
    --apply_chat_template \
    --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=http://address_to_model/v1/completions,num_concurrent=10,max_retries=3,tokenizer_backend=huggingface \
    --use_cache \
    --output_path path/to/output


n a car\'s radiator, cooling the body to prevent rapid increases in core body temperature and promoting heat tolerance… Repeated sauna use acclimates the body to heat and optimizes the body\'s response to future exposures, likely due to a biological phenomenon known as hormesis, a compensatory defense response following exposure to a mild stressor that is disproportionate to the magnitude of the stressor. Hormesis triggers a vast array of protective mechanisms that not only repair cell damage but also provide protection from subsequent exposures to more devastating stressors… The physiological responses to sauna use are remarkably similar to those experienced during moderate- to vigorous-intensity exercise. In fact, sauna use has been proposed as an alternative to exercise for people who are unable to engage in physical activity due to chronic disease or physical limitations.[13]\n\nBased on the article, what would be an important thing for a person to do after sauna use?\nA. Shower in cold water.\nB. Exercise.\nC. Eat a meal.\nD. Replenish fluids with filtered water.\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n D', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=1234, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=1, prompt_logprobs=1, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 791, 2768, 527, 5361, 5873, 4860, 320, 4291, 11503, 8, 922, 7926, 16088, 13, 128009, 128006, 882, 128007, 271, 53379, 8733, 1005, 11, 7170, 14183, 311, 439, 330, 9258, 8733, 73509, 1359, 374, 32971, 555, 2875, 9860, 28979, 14675, 311, 14560, 8798, 13, 1115, 14675, 658, 51650, 23900, 17508, 700, 91299, 1389, 459, 5376, 304, 279, 2547, 596, 6332, 9499, 1389, 430, 90974, 264, 30945, 461, 70, 38220, 2077, 16239, 18247, 408, 78738, 11, 41713, 11, 323, 9693, 3565, 4744, 96978, 24717, 430, 990, 3871, 311, 15301, 2162, 537, 10949, 323, 3044, 279, 2547, 369, 3938, 8798, 8631, 1105, 1981, 763, 3293, 11026, 11, 47958, 73509, 706, 22763, 439, 264, 3445, 311, 5376, 61961, 323, 7417, 8244, 2890, 11, 3196, 389, 29722, 828, 505, 90380, 11, 958, 44322, 11, 323, 7852, 4633, 7978, 13, 5046, 4040, 2802, 527, 279, 14955, 505, 7978, 315, 13324, 304, 279, 33479, 454, 822, 2209, 2464, 292, 18449, 31974, 32388, 38829, 320, 82071, 19694, 8, 19723, 11, 459, 14529, 33547, 7187, 6108, 41944, 4007, 315, 2890, 20124, 304, 810, 1109, 220, 17, 11, 3101, 6278, 57859, 3026, 505, 24024, 37355, 11, 902, 11054, 3831, 7902, 1990, 47958, 1005, 323, 11293, 4648, 323, 8624, 1981, 578, 735, 40, 19694, 14955, 8710, 430, 3026, 889, 1511, 279, 47958, 1403, 311, 2380, 3115, 824, 2046, 1051, 220, 1544, 3346, 2753, 4461, 311, 2815, 505, 41713, 14228, 11384, 1109, 3026, 889, 3287, 956, 1005, 279, 47958, 8032, 17, 60, 24296, 11, 279, 7720, 814, 10534, 1051, 1766, 311, 387, 19660, 43918, 25, 11258, 889, 1511, 279, 47958, 17715, 11157, 439, 3629, 11, 922, 3116, 311, 8254, 3115, 824, 2046, 11, 10534, 17715, 11157, 279, 7720, 1389, 323, 1051, 220, 1135, 3346, 2753, 4461, 311, 2815, 505, 41713, 14228, 11384, 8032, 17, 60, 763, 5369, 11, 21420, 47958, 3932, 1051, 1766, 311, 387, 220, 1272, 3346, 2753, 4461, 311, 2815, 505, 682, 11384, 315, 42227, 4648, 13, 4314, 14955, 5762, 837, 1524, 994, 13126, 4325, 11, 5820, 5990, 11, 323, 19433, 9547, 430, 2643, 617, 28160, 279, 3026, 596, 2890, 8032, 17, 60, 1131, 578, 735, 40, 19694, 1101, 10675, 430, 21420, 47958, 1005, 11293, 279, 5326, 315, 11469, 52857, 323, 44531, 596, 8624, 304, 264, 19660, 43918, 11827, 13, 11258, 889, 1511, 279, 47958, 1403, 311, 2380, 3115, 824, 2046, 1047, 264, 220, 2287, 3346, 4827, 5326, 315, 11469, 52857, 323, 264, 220, 2397, 3346, 4827, 5326, 315, 11469, 44531, 596, 8624, 11, 7863, 311, 3026, 889, 1511, 279, 47958, 1193, 832, 892, 824, 2046, 1981, 578, 2890, 7720, 5938, 449, 47958, 1005, 11838, 311, 1023, 13878, 315, 10723, 2890, 11, 439, 1664, 13, 11258, 24435, 304, 279, 735, 40, 19694, 4007, 889, 1511, 279, 47958, 3116, 311, 8254, 3115, 824, 2046, 1051, 220, 2813, 3346, 2753, 4461, 311, 2274, 94241, 24673, 11, 15851, 315, 279, 3026, 596, 34625, 26870, 11, 80431, 2704, 11, 7106, 5820, 11, 323, 47288, 2704, 320, 300, 17303, 555, 356, 31696, 535, 13128, 8, 1981, 849, 12313, 311, 1579, 9499, 59623, 279, 2547, 11, 95360, 5977, 264, 11295, 11, 22514, 2077, 13, 578, 6930, 323, 6332, 2547, 20472, 5376, 88101, 11, 323, 81366, 4675, 1157, 13, 578, 6930, 77662, 1176, 11, 16448, 311, 220, 1272, 32037, 320, 6849, 59572, 705, 323, 1243, 4442, 304, 6332, 2547, 9499, 12446, 11, 16448, 14297, 505, 220, 1806, 32037, 320, 3264, 13, 21, 59572, 11, 477, 4725, 8, 311, 220, 1987, 32037, 320, 1041, 13, 19, 59572, 8, 323, 1243, 19019, 7859, 311, 220, 2137, 32037, 320, 4278, 13, 17, 59572, 8, 1981, 220, 6938, 18029, 2612, 11, 264, 6767, 315, 279, 3392, 315, 990, 279, 4851, 27772, 304, 2077, 311, 279, 2547, 596, 1205, 369, 24463, 11, 12992, 555, 220, 1399, 311, 220, 2031, 3346, 11, 1418, 279, 4851, 4478, 320, 1820, 1396, 315, 34427, 824, 9568, 8, 12992, 323, 279, 12943, 8286, 320, 1820, 3392, 315, 6680, 62454, 8, 8625, 35957, 8032, 20, 60, 12220, 420, 892, 11, 13489, 220, 1135, 311, 220, 2031, 3346, 315, 279, 2547, 596, 6680, 6530, 374, 74494, 505, 279, 6332, 311, 279, 6930, 311, 28696, 81366, 13, 578, 5578, 1732, 33291, 13489, 220, 15, 13, 20, 21647, 315, 28566, 1418, 47958, 73509, 8032, 806, 60, 6515, 1088, 8798, 14675, 1101, 90974, 264, 41658, 5376, 304, 8244, 32426, 8286, 311, 50460, 279, 18979, 304, 6332, 6680, 8286, 13, 1115, 5376, 304, 32426, 8286, 539, 1193, 5825, 264, 21137, 2592, 315, 15962, 369, 81366, 11, 719, 433, 1101, 14385, 1093, 279, 3090, 304, 264, 1841, 596, 78190, 11, 28015, 279, 2547, 311, 5471, 11295, 12992, 304, 6332, 2547, 9499, 323, 22923, 8798, 25065, 1981, 1050, 43054, 47958, 1005, 1645, 566, 48571, 279, 2547, 311, 8798, 323, 7706, 4861, 279, 2547, 596, 2077, 311, 3938, 70530, 11, 4461, 4245, 311, 264, 24156, 25885, 3967, 439, 21548, 14093, 11, 264, 14573, 5382, 9232, 2077, 2768, 14675, 311, 264, 23900, 8631, 269, 430, 374, 80153, 311, 279, 26703, 315, 279, 8631, 269, 13, 92208, 14093, 31854, 264, 13057, 1358, 315, 29219, 24717, 430, 539, 1193, 13023, 2849, 5674, 719, 1101, 3493, 9313, 505, 17876, 70530, 311, 810, 33318, 8631, 1105, 1981, 578, 53194, 14847, 311, 47958, 1005, 527, 49723, 4528, 311, 1884, 10534, 2391, 24070, 12, 311, 71920, 20653, 8127, 10368, 13, 763, 2144, 11, 47958, 1005, 706, 1027, 11223, 439, 459, 10778, 311, 10368, 369, 1274, 889, 527, 12153, 311, 16988, 304, 7106, 5820, 4245, 311, 21249, 8624, 477, 7106, 9669, 8032, 1032, 2595, 29815, 389, 279, 4652, 11, 1148, 1053, 387, 459, 3062, 3245, 369, 264, 1732, 311, 656, 1306, 47958, 1005, 5380, 32, 13, 48471, 304, 9439, 3090, 627, 33, 13, 33918, 627, 34, 13, 45614, 264, 15496, 627, 35, 13, 1050, 87635, 819, 56406, 449, 18797, 3090, 627, 16533, 25, 128009, 128006, 78191, 128007, 271, 423], lora_request: LoRARequest(lora_name='meta-llama/Meta-Llama-3.1-8B-Instruct', lora_int_id=1, lora_path='here_is_path_to_lora', lora_local_path=None, long_lora_max_len=None), prompt_adapter_request: None.
INFO 09-18 02:18:58 logger.py:36] Received request cmpl-215927f7b38d4107bf9fef896613dadb-0: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nThe following are multiple choice questions (with answers) about college medicine.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSauna use, sometimes referred to as "sauna bathing," is characterized by short-term passive exposure to extreme heat. This exposure elicits mild hyperthermia – an increase in the body\'s core temperature – that induces a thermoregulatory response involving neuroendocrine, cardiovascular, and cytoprotective mechanisms that work together to restore homeostasis and condition the body for future heat stressors… In recent decades, sauna bathing has emerged as a means to increase lifespan and improve overall health, based on compelling data from observational, interventional, and mechanistic studies. Of particular interest are the findings from studies of participants in the Kuopio Ischemic Heart Disease Risk Factor (KIHD) Study, an ongoing prospective population-based cohort study of health outcomes in more than 2,300 middle-aged men from eastern Finland, which identified strong links between sauna use and reduced death and disease… The KIHD findings showed that men who used the sauna two to three times per week were 27 percent less likely to die from cardiovascular-related causes than men who didn\'t use the sauna.[2] Furthermore, the benefits they experienced were found to be dose-dependent: Men who used the sauna roughly twice as often, about four to seven times per week, experienced roughly twice the benefits – and were 50 percent less likely to die from cardiovascular-related causes.[2] In addition, frequent sauna users were found to be 40 percent less likely to die from all causes of premature death. These findings held true even when considering age, activity levels, and lifestyle factors that might have influenced the men\'s health.[2]... The KIHD also revealed that frequent sauna use reduced the risk of developing dementia and Alzheimer\'s disease in a dose-dependent manner. Men who used the sauna two to three times per week had a 66 percent lower risk of developing dementia and a 65 percent lower risk of developing Alzheimer\'s disease, compared to men who used the sauna only one time per week… The health benefits associated with sauna use extended to other aspects of mental health, as well. Men participating in the KIHD study who used the sauna four to seven times per week were 77 percent less likely to develop psychotic disorders, regardless of the men\'s dietary habits, socioeconomic status, physical activity, and inflammatory status (as measured by C-reactive protein)…Exposure to high temperature stresses the body, eliciting a rapid, robust response. The skin and core body temperatures increase markedly, and sweating ensues. The skin heats first, rising to 40°C (104°F), and then changes in core body temperature occur, rising slowly from 37°C (98.6°F, or normal) to 38°C (100.4°F) and then rapidly increasing to 39°C (102.2°F)…  Cardiac output, a measure of the amount of work the heart performs in response to the body\'s need for oxygen, increases by 60 to 70 percent, while the heart rate (the number of beats per minute) increases and the stroke volume (the amount of blood pumped) remains unchanged.[5] During this time, approximately 50 to 70 percent of the body\'s blood flow is redistributed from the core to the skin to facilitate sweating. The average person loses approximately 0.5 kg of sweat while sauna bathing.[11] Acute heat exposure also induces a transient increase in overall plasma volume to mitigate the decrease in core blood volume. This increase in plasma volume not only provides a reserve source of fluid for sweating, but it also acts like the water in a car\'s radiator, cooling the body to prevent rapid increases in core body temperature and promoting heat tolerance… Repeated sauna use acclimates the body to heat and optimizes the body\'s response to future exposures, likely due to a biological phenomenon known as hormesis, a compensatory defense response following exposure to a mild stressor that is disproportionate to the magnitude of the stressor. Hormesis triggers a vast array of protective mechanisms that not only repair cell damage but also provide protection from subsequent exposures to more devastating stressors… The physiological responses to sauna use are remarkably similar to those experienced during moderate- to vigorous-intensity exercise. In fact, sauna use has been proposed as an alternative to exercise for people who are unable to engage in physical activity due to chronic disease or physical limitations.[13]\n\nBased on the article, what would be an important thing for a person to do after sauna use?\nA. Shower in cold water.\nB. Exercise.\nC. Eat a meal.\nD. Replenish fluids with filtered water.\nAnswer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n B', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=1234, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=1, prompt_logprobs=1, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 791, 2768, 527, 5361, 5873, 4860, 320, 4291, 11503, 8, 922, 7926, 16088, 13, 128009, 128006, 882, 128007, 271, 53379, 8733, 1005, 11, 7170, 14183, 311, 439, 330, 9258, 8733, 73509, 1359, 374, 32971, 555, 2875, 9860, 28979, 14675, 311, 14560, 8798, 13, 1115, 14675, 658, 51650, 23900, 17508, 700, 91299, 1389, 459, 5376, 304, 279, 2547, 596, 6332, 9499, 1389, 430, 90974, 264, 30945, 461, 70, 38220, 2077, 16239, 18247, 408, 78738, 11, 41713, 11, 323, 9693, 3565, 4744, 96978, 24717, 430, 990, 3871, 311, 15301, 2162, 537, 10949, 323, 3044, 279, 2547, 369, 3938, 8798, 8631, 1105, 1981, 763, 3293, 11026, 11, 47958, 73509, 706, 22763, 439, 264, 3445, 311, 5376, 61961, 323, 7417, 8244, 2890, 11, 3196, 389, 29722, 828, 505, 90380, 11, 958, 44322, 11, 323, 7852, 4633, 7978, 13, 5046, 4040, 2802, 527, 279, 14955, 505, 7978, 315, 13324, 304, 279, 33479, 454, 822, 2209, 2464, 292, 18449, 31974, 32388, 38829, 320, 82071, 19694, 8, 19723, 11, 459, 14529, 33547, 7187, 6108, 41944, 4007, 315, 2890, 20124, 304, 810, 1109, 220, 17, 11, 3101, 6278, 57859, 3026, 505, 24024, 37355, 11, 902, 11054, 3831, 7902, 1990, 47958, 1005, 323, 11293, 4648, 323, 8624, 1981, 578, 735, 40, 19694, 14955, 8710, 430, 3026, 889, 1511, 279, 47958, 1403, 311, 2380, 3115, 824, 2046, 1051, 220, 1544, 3346, 2753, 4461, 311, 2815, 505, 41713, 14228, 11384, 1109, 3026, 889, 3287, 956, 1005, 279, 47958, 8032, 17, 60, 24296, 11, 279, 7720, 814, 10534, 1051, 1766, 311, 387, 19660, 43918, 25, 11258, 889, 1511, 279, 47958, 17715, 11157, 439, 3629, 11, 922, 3116, 311, 8254, 3115, 824, 2046, 11, 10534, 17715, 11157, 279, 7720, 1389, 323, 1051, 220, 1135, 3346, 2753, 4461, 311, 2815, 505, 41713, 14228, 11384, 8032, 17, 60, 763, 5369, 11, 21420, 47958, 3932, 1051, 1766, 311, 387, 220, 1272, 3346, 2753, 4461, 311, 2815, 505, 682, 11384, 315, 42227, 4648, 13, 4314, 14955, 5762, 837, 1524, 994, 13126, 4325, 11, 5820, 5990, 11, 323, 19433, 9547, 430, 2643, 617, 28160, 279, 3026, 596, 2890, 8032, 17, 60, 1131, 578, 735, 40, 19694, 1101, 10675, 430, 21420, 47958, 1005, 11293, 279, 5326, 315, 11469, 52857, 323, 44531, 596, 8624, 304, 264, 19660, 43918, 11827, 13, 11258, 889, 1511, 279, 47958, 1403, 311, 2380, 3115, 824, 2046, 1047, 264, 220, 2287, 3346, 4827, 5326, 315, 11469, 52857, 323, 264, 220, 2397, 3346, 4827, 5326, 315, 11469, 44531, 596, 8624, 11, 7863, 311, 3026, 889, 1511, 279, 47958, 1193, 832, 892, 824, 2046, 1981, 578, 2890, 7720, 5938, 449, 47958, 1005, 11838, 311, 1023, 13878, 315, 10723, 2890, 11, 439, 1664, 13, 11258, 24435, 304, 279, 735, 40, 19694, 4007, 889, 1511, 279, 47958, 3116, 311, 8254, 3115, 824, 2046, 1051, 220, 2813, 3346, 2753, 4461, 311, 2274, 94241, 24673, 11, 15851, 315, 279, 3026, 596, 34625, 26870, 11, 80431, 2704, 11, 7106, 5820, 11, 323, 47288, 2704, 320, 300, 17303, 555, 356, 31696, 535, 13128, 8, 1981, 849, 12313, 311, 1579, 9499, 59623, 279, 2547, 11, 95360, 5977, 264, 11295, 11, 22514, 2077, 13, 578, 6930, 323, 6332, 2547, 20472, 5376, 88101, 11, 323, 81366, 4675, 1157, 13, 578, 6930, 77662, 1176, 11, 16448, 311, 220, 1272, 32037, 320, 6849, 59572, 705, 323, 1243, 4442, 304, 6332, 2547, 9499, 12446, 11, 16448, 14297, 505, 220, 1806, 32037, 320, 3264, 13, 21, 59572, 11, 477, 4725, 8, 311, 220, 1987, 32037, 320, 1041, 13, 19, 59572, 8, 323, 1243, 19019, 7859, 311, 220, 2137, 32037, 320, 4278, 13, 17, 59572, 8, 1981, 220, 6938, 18029, 2612, 11, 264, 6767, 315, 279, 3392, 315, 990, 279, 4851, 27772, 304, 2077, 311, 279, 2547, 596, 1205, 369, 24463, 11, 12992, 555, 220, 1399, 311, 220, 2031, 3346, 11, 1418, 279, 4851, 4478, 320, 1820, 1396, 315, 34427, 824, 9568, 8, 12992, 323, 279, 12943, 8286, 320, 1820, 3392, 315, 6680, 62454, 8, 8625, 35957, 8032, 20, 60, 12220, 420, 892, 11, 13489, 220, 1135, 311, 220, 2031, 3346, 315, 279, 2547, 596, 6680, 6530, 374, 74494, 505, 279, 6332, 311, 279, 6930, 311, 28696, 81366, 13, 578, 5578, 1732, 33291, 13489, 220, 15, 13, 20, 21647, 315, 28566, 1418, 47958, 73509, 8032, 806, 60, 6515, 1088, 8798, 14675, 1101, 90974, 264, 41658, 5376, 304, 8244, 32426, 8286, 311, 50460, 279, 18979, 304, 6332, 6680, 8286, 13, 1115, 5376, 304, 32426, 8286, 539, 1193, 5825, 264, 21137, 2592, 315, 15962, 369, 81366, 11, 719, 433, 1101, 14385, 1093, 279, 3090, 304, 264, 1841, 596, 78190, 11, 28015, 279, 2547, 311, 5471, 11295, 12992, 304, 6332, 2547, 9499, 323, 22923, 8798, 25065, 1981, 1050, 43054, 47958, 1005, 1645, 566, 48571, 279, 2547, 311, 8798, 323, 7706, 4861, 279, 2547, 596, 2077, 311, 3938, 70530, 11, 4461, 4245, 311, 264, 24156, 25885, 3967, 439, 21548, 14093, 11, 264, 14573, 5382, 9232, 2077, 2768, 14675, 311, 264, 23900, 8631, 269, 430, 374, 80153, 311, 279, 26703, 315, 279, 8631, 269, 13, 92208, 14093, 31854, 264, 13057, 1358, 315, 29219, 24717, 430, 539, 1193, 13023, 2849, 5674, 719, 1101, 3493, 9313, 505, 17876, 70530, 311, 810, 33318, 8631, 1105, 1981, 578, 53194, 14847, 311, 47958, 1005, 527, 49723, 4528, 311, 1884, 10534, 2391, 24070, 12, 311, 71920, 20653, 8127, 10368, 13, 763, 2144, 11, 47958, 1005, 706, 1027, 11223, 439, 459, 10778, 311, 10368, 369, 1274, 889, 527, 12153, 311, 16988, 304, 7106, 5820, 4245, 311, 21249, 8624, 477, 7106, 9669, 8032, 1032, 2595, 29815, 389, 279, 4652, 11, 1148, 1053, 387, 459, 3062, 3245, 369, 264, 1732, 311, 656, 1306, 47958, 1005, 5380, 32, 13, 48471, 304, 9439, 3090, 627, 33, 13, 33918, 627, 34, 13, 45614, 264, 15496, 627, 35, 13, 1050, 87635, 819, 56406, 449, 18797, 3090, 627, 16533, 25, 128009, 128006, 78191, 128007, 271, 426], lora_request: LoRARequest(lora_name='meta-llama/Meta-Llama-3.1-8B-Instruct', lora_int_id=1, lora_path='/here_is_path_to_lora', lora_local_path=None, long_lora_max_len=None), prompt_adapter_request: None.
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO 09-18 02:18:58 logger.py:36] Received request cmpl-f15b280099c04bf7b666b102190147d8-0: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nThe following are multiple choice questions (with answers) about high school world history.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThis question refers to the following information.\nSource 1:\n"You may well ask: "Why direct action? Why sit-ins, marches and so forth? Isn\'t negotiation a better path?" You are quite right in calling, for negotiation. Indeed, this is the very purpose of direct action. Nonviolent direct action seeks to create such a crisis and 

INFO:     Shutting down
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 02:18:58 launcher.py:98] AsyncLLMEngine is already dead, terminating server process
INFO: - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [8]
INFO 09-18 02:18:58 server.py:228] vLLM ZMQ RPC Server was interrupted.
Future exception was never retrieved
future: <Future finished exception=RuntimeError('LLMEngine should not be pickled!')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 112, in _wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1589, in execute_model
    output: SamplerOutput = self.model.sample(
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 466, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 273, in forward
    probs = torch.softmax(logits, dim=-1, dtype=torch.float)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.85 GiB. GPU 0 has a total capacity of 79.33 GiB of which 3.02 GiB is free. Process 308135 has 76.29 GiB memory in use. Of the allocated memory 75.34 GiB is allocated by PyTorch, with 31.38 MiB allocated in private pools (e.g., CUDA Graphs), and 79.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 859, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 733, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 673, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 340, in step_async
    outputs = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 185, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 125, in _wrapper
    pickle.dump(dumped_inputs, filep)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 563, in __reduce__
    raise RuntimeError("LLMEngine should not be pickled!")
RuntimeError: LLMEngine should not be pickled!

I would like to know if this is a bug from vllm where the request not in a queue and causing overloading vllm server, or the error is coming from somewhere else, could anyone help me for this case? Thank you very much in advance!

Before submitting a new issue...

xiaobo-Chen commented 5 days ago

I have the same question as you when I use the lm-eval to evaluate the LLMs. Do you have solved this question? my command and info are as follows: lm-eval --model vllm --model_args pretrained=/home/T3090U1/CZ/model/Qwen1.5-7B-Chat/,dtype=auto,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,max_model_len=4096 --tasks=leaderboard --batch_size=auto --output_path=/home/T3090U1/CZ/work3/output

error: `rank0: Traceback (most recent call last): rank0: File "/home/T3090U1/anaconda3/envs/work3/bin/lm-eval", line 8, in

rank0: File "/home/T3090U1/CZ/work3/lm_eval/main.py", line 369, in cli_evaluate rank0: results = evaluator.simple_evaluate( rank0: File "/home/T3090U1/CZ/work3/lm_eval/utils.py", line 395, in _wrapper rank0: return fn(*args, kwargs) rank0: File "/home/T3090U1/CZ/work3/lm_eval/evaluator.py", line 277, in simple_evaluate rank0: results = evaluate( rank0: File "/home/T3090U1/CZ/work3/lm_eval/utils.py", line 395, in _wrapper rank0: return fn(*args, *kwargs) rank0: File "/home/T3090U1/CZ/work3/lm_eval/evaluator.py", line 444, in evaluate rank0: resps = getattr(lm, reqtype)(cloned_reqs) rank0: File "/home/T3090U1/CZ/work3/lm_eval/api/model.py", line 370, in loglikelihood rank0: return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm) rank0: File "/home/T3090U1/CZ/work3/lm_eval/models/vllm_causallms.py", line 415, in _loglikelihood_tokens rank0: outputs = self._model_generate(requests=inputs, generate=False) rank0: File "/home/T3090U1/CZ/work3/lm_eval/models/vllm_causallms.py", line 248, in _model_generate rank0: outputs = self.model.generate( rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/utils.py", line 1036, in inner rank0: return fn(args, kwargs) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 348, in generate rank0: outputs = self._run_engine(use_tqdm=use_tqdm) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 715, in _run_engine rank0: step_outputs = self.llm_engine.step() rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 1223, in step rank0: outputs = self.model_executor.execute_model( rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 78, in execute_model rank0: driver_outputs = self._driver_execute_model(execute_model_req) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 162, in _driver_execute_model rank0: return self.driver_worker.execute_model(execute_model_req) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/worker/worker_base.py", line 327, in execute_model rank0: output = self.model_runner.execute_model( rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/worker/model_runner_base.py", line 125, in _wrapper rank0: pickle.dump(dumped_inputs, filep) rank0: File "/home/T3090U1/anaconda3/envs/work3/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 563, in reduce rank0: raise RuntimeError("LLMEngine should not be pickled!") rank0: RuntimeError: LLMEngine should not be pickled! `