vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.52k stars 4.43k forks source link

[Bug]: torch.OutOfMemoryError: CUDA out of memory #7721

Closed Sandwiches97 closed 2 months ago

Sandwiches97 commented 2 months ago

Your current environment

vllm==0.5.4 GPU: L20, Memory 46GB ```text Package Version --------------------------------- ------------ aiohappyeyeballs 2.3.7 aiohttp 3.10.4 aiosignal 1.3.1 anaconda-anon-usage 0.4.4 annotated-types 0.7.0 anyio 4.4.0 archspec 0.2.3 asttokens 2.0.5 astunparse 1.6.3 attrs 23.1.0 beautifulsoup4 4.12.3 boltons 23.0.0 Brotli 1.0.9 certifi 2024.7.4 cffi 1.16.0 chardet 4.0.0 charset-normalizer 2.0.4 click 8.1.7 cloudpickle 3.0.0 cmake 3.30.2 conda 24.5.0 conda-build 24.5.1 conda-content-trust 0.2.0 conda_index 0.5.0 conda-libmamba-solver 24.1.0 conda-package-handling 2.3.0 conda_package_streaming 0.10.0 cryptography 42.0.5 datasets 2.21.0 decorator 5.1.1 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 dnspython 2.6.1 executing 0.8.3 expecttest 0.2.1 fastapi 0.112.1 filelock 3.13.1 frozendict 2.4.2 frozenlist 1.4.1 fsspec 2024.6.1 gmpy2 2.1.2 h11 0.14.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.24.5 hypothesis 6.108.4 idna 3.7 interegular 0.3.3 ipython 8.25.0 jedi 0.19.1 Jinja2 3.1.4 jiter 0.5.0 jsonpatch 1.33 jsonpointer 2.1 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 lark 1.2.2 libarchive-c 2.9 libmambapy 1.5.8 lintrunner 0.12.5 llvmlite 0.43.0 lm-format-enforcer 0.10.3 MarkupSafe 2.1.3 matplotlib-inline 0.1.6 menuinst 2.1.1 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 more-itertools 10.1.0 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 numba 0.60.0 numpy 1.26.4 nvidia-ml-py 12.560.30 openai 1.41.0 opencv-python 4.10.0.84 optree 0.12.1 outlines 0.0.46 packaging 24.1 pandas 2.2.2 parso 0.8.3 pexpect 4.8.0 Pillow 10.1.0 pip 24.0 pkginfo 1.10.0 platformdirs 3.10.0 pluggy 1.0.0 prometheus_client 0.20.0 prometheus-fastapi-instrumentator 7.0.0 prompt-toolkit 3.0.43 protobuf 5.27.3 psutil 5.9.0 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 17.0.0 pycosat 0.6.6 pycountry 24.6.1 pycparser 2.21 pydantic 2.8.2 pydantic_core 2.20.1 Pygments 2.15.1 PySocks 1.7.1 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-etcd 0.4.5 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.1.1 ray 2.34.0 referencing 0.30.2 regex 2024.7.24 requests 2.32.3 rpds-py 0.10.6 ruamel.yaml 0.17.21 safetensors 0.4.4 sentencepiece 0.1.99 setuptools 69.5.1 six 1.16.0 sniffio 1.3.1 sortedcontainers 2.4.0 soupsieve 2.5 stack-data 0.2.0 starlette 0.38.2 starlette_exporter 0.23.0 sympy 1.13.1 tiktoken 0.7.0 timm 0.9.10 tokenizers 0.19.1 torch 2.4.0 torchaudio 2.4.0 torchelastic 0.2.2 torchvision 0.19.0 tqdm 4.66.4 traitlets 5.14.3 transformers 4.44.0 triton 3.0.0 truststore 0.8.0 types-dataclasses 0.6.6 typing_extensions 4.11.0 tzdata 2024.1 urllib3 2.2.2 uvicorn 0.30.6 uvloop 0.20.0 vllm 0.5.4 vllm-flash-attn 2.6.1 watchfiles 0.23.0 wcwidth 0.2.5 websockets 12.0 wheel 0.43.0 xformers 0.0.27.post2 xxhash 3.5.0 yarl 1.9.4 zstandard 0.22.0 Model: MiniCPM-V2.6, from: https://huggingface.co/openbmb/MiniCPM-V-2_6 ```

img_v3_02du_9ada6e58-9987-48f4-951c-267a26b083dg

🐛 Describe the bug

my py script is as follows:

from vllm import LLM, SamplingParams
from transformers import AutoModel, AutoTokenizer
import os
import torch
from PIL import Image
torch.cuda.empty_cache()

model_name = "./MiniCPM-V-2_6/"
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                                trust_remote_code=True)
llm = LLM(
            model=model_name,
            trust_remote_code=True,
            tensor_parallel_size=1,
            gpu_memory_utilization=0.9,
            max_seq_len_to_capture=2048,
            enforce_eager=False
        )

after Runing the code, i get an OOM error

2024-08-21 07:23:29,967 - root - INFO - I'm a message
2024-08-21 07:23:33,518 - datasets - INFO - PyTorch version 2.4.0 available.
2024-08-21 07:23:33,712 - root - INFO - app
2024-08-21 07:23:34,290 - transformers_modules.v1.configuration_minicpm - INFO - vision_config is None, using default vision config
INFO 08-21 07:23:34 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/opt/apps/models/miniCPM-v2.6/v1', speculative_config=None, tokenizer='/opt/apps/models/miniCPM-v2.6/v1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/opt/apps/models/miniCPM-v2.6/v1, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-21 07:23:34 model_runner.py:720] Starting to load model /opt/apps/models/miniCPM-v2.6/v1...
/opt/conda/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.07it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.52it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.30it/s]

INFO 08-21 07:23:38 model_runner.py:732] Loading model weights took 15.1930 GB
/opt/conda/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/apps/algo-kefu-vqa-server/app.py", line 22, in <module>
[rank0]:     vqa_model = VQA(model_name)
[rank0]:                 ^^^^^^^^^^^^^^^
[rank0]:   File "/opt/apps/algo-kefu-vqa-server/src/vllm_api.py", line 24, in __init__
[rank0]:     self.llm = LLM(
[rank0]:                ^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 158, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 445, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 940, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1363, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:                                     ^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/minicpmv.py", line 624, in forward
[rank0]:     vlm_embeddings, _ = self.get_embedding(input_ids, image_inputs)
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/minicpmv.py", line 530, in get_embedding
[rank0]:     vision_hidden_states = self.get_vision_hidden_states(image_inputs)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/minicpmv.py", line 980, in get_vision_hidden_states
[rank0]:     vision_embedding = self.vpm(
[rank0]:                        ^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/na_vit.py", line 785, in forward
[rank0]:     encoder_outputs = self.encoder(
[rank0]:                       ^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/na_vit.py", line 686, in forward
[rank0]:     layer_outputs = encoder_layer(
[rank0]:                     ^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/na_vit.py", line 585, in forward
[rank0]:     hidden_states, attn_weights = self.self_attn(
[rank0]:                                   ^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/na_vit.py", line 347, in forward
[rank0]:     attn_weights = nn.functional.softmax(attn_weights,
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/functional.py", line 1890, in softmax
[rank0]:     ret = input.softmax(dim, dtype=dtype)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 44.53 GiB of which 187.94 MiB is free. Process 3882360 has 44.34 GiB memory in use. Of the allocated memory 43.67 GiB is allocated by PyTorch, and 335.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

i think my GPU memory is enough for loading this model. when i use the below command

vllm serve ./miniCPM-v2.6/ --dtype auto \
        --max-model-len 2048 \
        --gpu_memory_utilization 0.9 \
        --host 0.0.0.0  --port 8002 \
        --tensor-parallel-size 1 \
        --trust-remote-code

the vllm server can be started normally

But, i want use the py script, is any bug in there? what should i do? thx!!!

jeejeelee commented 2 months ago

Try to reduce gpu_memory_utilization

Sandwiches97 commented 2 months ago

Try to reduce gpu_memory_utilization

change gpu_memory_utilization to 0.8

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 44.53 GiB of which 187.94 MiB is free. Process 3909982 has 44.34 GiB memory in use. Of the allocated memory 43.67 GiB is allocated by PyTorch, and 335.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

change gpu_memory_utilization to 0.6

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 44.53 GiB of which 187.94 MiB is free. Process 3910771 has 44.34 GiB memory in use. Of the allocated memory 43.67 GiB is allocated by PyTorch, and 335.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

change gpu_memory_utilization to 0.5

 torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 44.53 GiB of which 187.94 MiB is free. Process 3914747 has 44.34 GiB memory in use. Of the allocated memory 43.67 GiB is allocated by PyTorch, and 335.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

it seems not work..., and all the error info are the same one

Sandwiches97 commented 2 months ago

Supplement some log information from the service started by vllm command

vllm serve ./miniCPM-v2.6/ --dtype auto \
        --max-model-len 2048 \
        --gpu_memory_utilization 0.9 \
        --host 0.0.0.0  --port 8002 \
        --tensor-parallel-size 1 \
        --trust-remote-code

INFO:

INFO 08-21 08:30:55 api_server.py:339] vLLM API server version 0.5.4
INFO 08-21 08:30:55 api_server.py:340] args: Namespace(model_tag='/opt/apps/models/miniCPM-v2.6/v1', host='0.0.0.0', port=8002, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/opt/apps/models/miniCPM-v2.6/v1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['dewu-vqa-chat'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fe279156b60>)
WARNING 08-21 08:30:56 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-21 08:30:56 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/opt/apps/models/miniCPM-v2.6/v1', speculative_config=None, tokenizer='/opt/apps/models/miniCPM-v2.6/v1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=dewu-vqa-chat, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-21 08:30:56 model_runner.py:720] Starting to load model /opt/apps/models/miniCPM-v2.6/v1...
/opt/conda/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.83it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.78it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.47it/s]

INFO 08-21 08:31:00 model_runner.py:732] Loading model weights took 15.1930 GB
/opt/conda/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
INFO 08-21 08:31:06 gpu_executor.py:102] # GPU blocks: 21054, # CPU blocks: 4681
INFO 08-21 08:31:11 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-21 08:31:11 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-21 08:31:21 model_runner.py:1225] Graph capturing finished in 10 secs.
WARNING 08-21 08:31:21 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-21 08:31:21 launcher.py:14] Available routes are:
INFO 08-21 08:31:21 launcher.py:22] Route: /openapi.json, Methods: GET, HEAD
INFO 08-21 08:31:21 launcher.py:22] Route: /docs, Methods: GET, HEAD
INFO 08-21 08:31:21 launcher.py:22] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-21 08:31:21 launcher.py:22] Route: /redoc, Methods: GET, HEAD
INFO 08-21 08:31:21 launcher.py:22] Route: /health, Methods: GET
INFO 08-21 08:31:21 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-21 08:31:21 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-21 08:31:21 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-21 08:31:21 launcher.py:22] Route: /version, Methods: GET
INFO 08-21 08:31:21 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-21 08:31:21 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-21 08:31:21 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [362]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
INFO 08-21 08:31:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
jeejeelee commented 2 months ago

Using the offline_inference script you provided above, I can reproduce your OOM error. It appears that during the profile_run process, the forward of the vpm consumed too much GPU memory (reaching around 70GB in my A800 device), that's crazy, could you please help take a look at this? @HwwwwwwwH

HwwwwwwwH commented 2 months ago

Emmm, that's because the there's only 64 or 96 tokens per image in MiniCPM-V. So in profile_run there might be a large number of images. This can be resolved in two ways: