Closed Sandwiches97 closed 3 months ago
Try to reduce gpu_memory_utilization
Try to reduce
gpu_memory_utilization
change gpu_memory_utilization
to 0.8
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 44.53 GiB of which 187.94 MiB is free. Process 3909982 has 44.34 GiB memory in use. Of the allocated memory 43.67 GiB is allocated by PyTorch, and 335.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
change gpu_memory_utilization
to 0.6
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 44.53 GiB of which 187.94 MiB is free. Process 3910771 has 44.34 GiB memory in use. Of the allocated memory 43.67 GiB is allocated by PyTorch, and 335.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
change gpu_memory_utilization
to 0.5
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 44.53 GiB of which 187.94 MiB is free. Process 3914747 has 44.34 GiB memory in use. Of the allocated memory 43.67 GiB is allocated by PyTorch, and 335.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
it seems not work..., and all the error info are the same one
Supplement some log information from the service started by vllm command
vllm serve ./miniCPM-v2.6/ --dtype auto \
--max-model-len 2048 \
--gpu_memory_utilization 0.9 \
--host 0.0.0.0 --port 8002 \
--tensor-parallel-size 1 \
--trust-remote-code
INFO:
INFO 08-21 08:30:55 api_server.py:339] vLLM API server version 0.5.4
INFO 08-21 08:30:55 api_server.py:340] args: Namespace(model_tag='/opt/apps/models/miniCPM-v2.6/v1', host='0.0.0.0', port=8002, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/opt/apps/models/miniCPM-v2.6/v1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['dewu-vqa-chat'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fe279156b60>)
WARNING 08-21 08:30:56 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-21 08:30:56 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/opt/apps/models/miniCPM-v2.6/v1', speculative_config=None, tokenizer='/opt/apps/models/miniCPM-v2.6/v1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=dewu-vqa-chat, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-21 08:30:56 model_runner.py:720] Starting to load model /opt/apps/models/miniCPM-v2.6/v1...
/opt/conda/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 2.83it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.78it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 1.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.47it/s]
INFO 08-21 08:31:00 model_runner.py:732] Loading model weights took 15.1930 GB
/opt/conda/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
warnings.warn(
INFO 08-21 08:31:06 gpu_executor.py:102] # GPU blocks: 21054, # CPU blocks: 4681
INFO 08-21 08:31:11 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-21 08:31:11 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-21 08:31:21 model_runner.py:1225] Graph capturing finished in 10 secs.
WARNING 08-21 08:31:21 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-21 08:31:21 launcher.py:14] Available routes are:
INFO 08-21 08:31:21 launcher.py:22] Route: /openapi.json, Methods: GET, HEAD
INFO 08-21 08:31:21 launcher.py:22] Route: /docs, Methods: GET, HEAD
INFO 08-21 08:31:21 launcher.py:22] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-21 08:31:21 launcher.py:22] Route: /redoc, Methods: GET, HEAD
INFO 08-21 08:31:21 launcher.py:22] Route: /health, Methods: GET
INFO 08-21 08:31:21 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-21 08:31:21 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-21 08:31:21 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-21 08:31:21 launcher.py:22] Route: /version, Methods: GET
INFO 08-21 08:31:21 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-21 08:31:21 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-21 08:31:21 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO: Started server process [362]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
INFO 08-21 08:31:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Using the offline_inference script you provided above, I can reproduce your OOM error. It appears that during the profile_run
process, the forward of the vpm
consumed too much GPU memory (reaching around 70GB in my A800 device), that's crazy, could you please help take a look at this? @HwwwwwwwH
Emmm, that's because the there's only 64 or 96 tokens per image in MiniCPM-V
. So in profile_run
there might be a large number of images. This can be resolved in two ways:
max_model_len
(to 2048
/4096
)max_num_seqs
(to 32
)
Your current environment
vllm==0.5.4 GPU: L20, Memory 46GB
```text Package Version --------------------------------- ------------ aiohappyeyeballs 2.3.7 aiohttp 3.10.4 aiosignal 1.3.1 anaconda-anon-usage 0.4.4 annotated-types 0.7.0 anyio 4.4.0 archspec 0.2.3 asttokens 2.0.5 astunparse 1.6.3 attrs 23.1.0 beautifulsoup4 4.12.3 boltons 23.0.0 Brotli 1.0.9 certifi 2024.7.4 cffi 1.16.0 chardet 4.0.0 charset-normalizer 2.0.4 click 8.1.7 cloudpickle 3.0.0 cmake 3.30.2 conda 24.5.0 conda-build 24.5.1 conda-content-trust 0.2.0 conda_index 0.5.0 conda-libmamba-solver 24.1.0 conda-package-handling 2.3.0 conda_package_streaming 0.10.0 cryptography 42.0.5 datasets 2.21.0 decorator 5.1.1 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 dnspython 2.6.1 executing 0.8.3 expecttest 0.2.1 fastapi 0.112.1 filelock 3.13.1 frozendict 2.4.2 frozenlist 1.4.1 fsspec 2024.6.1 gmpy2 2.1.2 h11 0.14.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.24.5 hypothesis 6.108.4 idna 3.7 interegular 0.3.3 ipython 8.25.0 jedi 0.19.1 Jinja2 3.1.4 jiter 0.5.0 jsonpatch 1.33 jsonpointer 2.1 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 lark 1.2.2 libarchive-c 2.9 libmambapy 1.5.8 lintrunner 0.12.5 llvmlite 0.43.0 lm-format-enforcer 0.10.3 MarkupSafe 2.1.3 matplotlib-inline 0.1.6 menuinst 2.1.1 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 more-itertools 10.1.0 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 numba 0.60.0 numpy 1.26.4 nvidia-ml-py 12.560.30 openai 1.41.0 opencv-python 4.10.0.84 optree 0.12.1 outlines 0.0.46 packaging 24.1 pandas 2.2.2 parso 0.8.3 pexpect 4.8.0 Pillow 10.1.0 pip 24.0 pkginfo 1.10.0 platformdirs 3.10.0 pluggy 1.0.0 prometheus_client 0.20.0 prometheus-fastapi-instrumentator 7.0.0 prompt-toolkit 3.0.43 protobuf 5.27.3 psutil 5.9.0 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 17.0.0 pycosat 0.6.6 pycountry 24.6.1 pycparser 2.21 pydantic 2.8.2 pydantic_core 2.20.1 Pygments 2.15.1 PySocks 1.7.1 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-etcd 0.4.5 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.1.1 ray 2.34.0 referencing 0.30.2 regex 2024.7.24 requests 2.32.3 rpds-py 0.10.6 ruamel.yaml 0.17.21 safetensors 0.4.4 sentencepiece 0.1.99 setuptools 69.5.1 six 1.16.0 sniffio 1.3.1 sortedcontainers 2.4.0 soupsieve 2.5 stack-data 0.2.0 starlette 0.38.2 starlette_exporter 0.23.0 sympy 1.13.1 tiktoken 0.7.0 timm 0.9.10 tokenizers 0.19.1 torch 2.4.0 torchaudio 2.4.0 torchelastic 0.2.2 torchvision 0.19.0 tqdm 4.66.4 traitlets 5.14.3 transformers 4.44.0 triton 3.0.0 truststore 0.8.0 types-dataclasses 0.6.6 typing_extensions 4.11.0 tzdata 2024.1 urllib3 2.2.2 uvicorn 0.30.6 uvloop 0.20.0 vllm 0.5.4 vllm-flash-attn 2.6.1 watchfiles 0.23.0 wcwidth 0.2.5 websockets 12.0 wheel 0.43.0 xformers 0.0.27.post2 xxhash 3.5.0 yarl 1.9.4 zstandard 0.22.0 Model: MiniCPM-V2.6, from: https://huggingface.co/openbmb/MiniCPM-V-2_6 ```🐛 Describe the bug
my py script is as follows:
after Runing the code, i get an OOM error
i think my GPU memory is enough for loading this model. when i use the below command
the vllm server can be started normally
But, i want use the py script, is any bug in there? what should i do? thx!!!