vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.04k stars 4.72k forks source link

Can't load the model anymore #2521

Closed tom-doerr closed 10 months ago

tom-doerr commented 10 months ago

I can't load my model anymore, no matter what parameters I use for loading the model.

model_name_or_path = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False,
        )

pp = pprint.PrettyPrinter(indent=2)
pprint = pp.pprint
llm = None

def load_model():
    global llm   
    del llm   
    if ray.is_initialized():
        ray.shutdown()
    if 'AWQ' in model_name_or_path:
        quantization = "awq"
        dtype = "auto"
    if 'GPTQ' in model_name_or_path:
        quantization = "gptq"
        dtype = "float16"    
    else:
        quantization = None
        dtype = "auto"
    llm = LLM(model=model_name_or_path, quantization=quantization, dtype=dtype,
            # tensor_parallel_size=1,
            tensor_parallel_size=2,    
            # pipeline_parallel_size=1,  
            # gpu_memory_utilization=0.5,
            # gpu_memory_utilization=0.6,
            # gpu_memory_utilization=0.8,
            # gpu_memory_utilization=0.9,
            gpu_memory_utilization=random.uniform(0.4, 1.0),
            # swap_space=10,
            # swap_space=20,
            # swap_space=40,
            swap_space=int(random.uniform(1, 50)),
            )                                      
    # input()
    print('Model loaded')

load_model()
nvidia-smi
Sat Jan 20 19:55:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:00:10.0 Off |                    0 |
| N/A   28C    P0              33W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  | 00000000:00:11.0 Off |                    0 |
| N/A   29C    P0              38W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     On  | 00000000:00:1B.0 Off |                    0 |
|  0%   63C    P0             249W / 300W |  10805MiB / 46068MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    2   N/A  N/A      2367      C   python3                                    8798MiB |
|    2   N/A  N/A     26564      C   .../.pyenv/versions/3.10.13/bin/python     1632MiB |
|    2   N/A  N/A     94181      C   python                                      352MiB |
+---------------------------------------------------------------------------------------+
WARNING 01-20 19:50:11 config.py:457] Casting torch.bfloat16 to torch.float16.                                                           
WARNING 01-20 19:50:11 config.py:175] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-20 19:50:13,358 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 01-20 19:50:14 llm_engine.py:70] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=
torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=gptq, enforce_eager=False, seed=0)
INFO 01-20 19:50:38 llm_engine.py:275] # GPU blocks: 5959, # CPU blocks: 33792                                                                                                                                                                                                    
INFO 01-20 19:51:03 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-20 19:51:03 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
(RayWorkerVllm pid=117639) INFO 01-20 19:51:03 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=117639) INFO 01-20 19:51:03 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
[E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: operation not permitted when stream is capturing        
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                                                                                           
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                   
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                       

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc28c992617 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libc10.so)                                                                  
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc28c94d98d in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc2985b09f8 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fc2184b3af0 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fc2184b7918 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x7fc2184ce15b in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7fc2184ce468 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)  frame #7: <unknown function> + 0xdc253 (0x7fc25ccb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)                                                                                                                                                                                   
frame #8: <unknown function> + 0x94ac3 (0x7fc33c441ac3 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                                                                        frame #9: <unknown function> + 0x126850 (0x7fc33c4d3850 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                                                                                       

[2024-01-20 19:51:06,991 E 110448 121874] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                                                                                           
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                   
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                       
                                                                                                                                         Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc28c992617 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc28c94d98d in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc2985b09f8 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fc2184b3af0 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fc2184b7918 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x7fc2184ce15b in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7fc2184ce468 in /home/conic/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fc25ccb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fc33c441ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7fc33c4d3850 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-01-20 19:51:07,000 E 110448 121874] logging.cc:104: Stack trace:                    
 /home/conic/.local/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x7fc1e9635b5a] ray::operator<<()                            
/home/conic/.local/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x7fc1e9638298] ray::TerminateHandler()
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fc25cc8220c]                                                                          
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fc25cc82277]                                                                          
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7fc25cc821fe]                                                                          
/home/conic/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xc86dc5) [0x7fc218239dc5] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc25ccb0253]                           
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc33c441ac3]                                
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fc33c4d3850]                                                                                                                                                                                                                       

*** SIGABRT received at time=1705780266 on cpu 30 ***                                                                                                                                                                                                                             
PC: @     0x7fc33c4439fc  (unknown)  pthread_kill                                                                                                                                                                                                                                 
    @     0x7fc33c3ef520  (unknown)  (unknown)                                                                                                                                                                                                                                    
[2024-01-20 19:51:07,000 E 110448 121874] logging.cc:361: *** SIGABRT received at time=1705780266 on cpu 30 ***                          
[2024-01-20 19:51:07,001 E 110448 121874] logging.cc:361: PC: @     0x7fc33c4439fc  (unknown)  pthread_kill                              
[2024-01-20 19:51:07,001 E 110448 121874] logging.cc:361:     @     0x7fc33c3ef520  (unknown)  (unknown)                              
Fatal Python error: Aborted                                                                                                              

Extension modules: zstandard.backend_c, simplejson._speedups, charset_normalizer.md, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networ
ks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, numpy.core._multiarray_umath, nu
mpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc6
4, numpy.random._generator, yaml._yaml, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproc
title, uvloop.loop, ray._raylet, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, grpc._cython.cygrpc, pyarrow.lib, pyarrow._hdfsio, pyarrow._json (total: 66)
./main.sh: line 29: 110448 Aborted                 (core dumped) ./reddit_main.py                                                        
tom-doerr commented 10 months ago

Managed to start it with

gpu_memory_utilization = 0.6219178764397234
swap_space = 27

Not sure if the parameters are actually important or it just randomly worked again (which I think is more likely). Maybe an issue related to loading code from disk, my ZFS file system shows issues with multiple drives.

flexwang commented 10 months ago

We encoutered exact same issue.