Open AlphaINF opened 1 month ago
Indeed, you can see the llama.cpp loads and quantizes the model weights incrementally here:
This behavior is well documented at: https://docs.vllm.ai/en/latest/quantization/fp8.html .
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
Although, incrementally processing layers is much preferred.
Your current environment
(venv-vllm-54) (base) root@I1ba088648b009018e4:/hy-tmp# nvidia-smi Tue Aug 6 10:29:16 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A800 80GB PCIe Off | 00000000:6B:00.0 Off | 0 | | N/A 31C P0 42W / 300W | 4MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
aiohappyeyeballs 2.3.4 aiohttp 3.10.1 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.4.0 async-timeout 4.0.3 attrs 24.1.0 certifi 2024.7.4 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 cmake 3.30.2 datasets 2.20.0 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 exceptiongroup 1.2.2 fastapi 0.112.0 filelock 3.15.4 frozenlist 1.4.1 fsspec 2024.5.0 h11 0.14.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.24.5 idna 3.7 importlib-metadata 8.2.0 importlib-resources 6.4.0 interegular 0.3.3 jinja2 3.1.4 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 lark 1.1.9 llvmlite 0.41.1 lm-format-enforcer 0.10.3 MarkupSafe 2.1.5 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.1 ninja 1.11.1.1 numba 0.58.1 numpy 1.24.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.20 nvidia-nvtx-cu12 12.1.105 openai 1.39.0 outlines 0.0.46 packaging 24.1 pandas 2.0.3 pillow 10.4.0 pip 21.1.1 pkgutil-resolve-name 1.3.10 prometheus-client 0.20.0 prometheus-fastapi-instrumentator 7.0.0 protobuf 5.27.3 psutil 6.0.0 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 17.0.0 pyarrow-hotfix 0.6 pycountry 24.6.1 pydantic 2.8.2 pydantic-core 2.20.1 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.1.0 ray 2.10.0 referencing 0.35.1 regex 2024.7.24 requests 2.32.3 rpds-py 0.19.1 safetensors 0.4.4 sentencepiece 0.2.0 setuptools 56.0.0 six 1.16.0 sniffio 1.3.1 starlette 0.37.2 sympy 1.13.1 tiktoken 0.7.0 tokenizers 0.19.1 torch 2.4.0 torchvision 0.19.0 tqdm 4.66.5 transformers 4.43.4 triton 3.0.0 typing-extensions 4.12.2 tzdata 2024.1 urllib3 2.2.2 uvicorn 0.30.5 uvloop 0.19.0 vllm 0.5.4 vllm-flash-attn 2.6.1 watchfiles 0.22.0 websockets 12.0 xformers 0.0.27.post2 xxhash 3.4.1 yarl 1.9.4 zipp 3.19.2
🐛 Describe the bug
Currently I download a Qwen2-72B-Instruct model on my machine with 1x A100-80G I try to use the following command to start the server but it shows out of memory
When I test to load a fp16 model and set to quantized as fp8, I saw the VRAM increase initially, when the model has been loaded, the VRAM consume will sharply decrease at about half of the peak.(eg: loading qwen2-7b the VRAM will raise to 14GB at first and decrease to 7GB after quantized the model to fp8)
In my opinion, this bugs caused by the loading and quantizing stategy, if the system quantizing the parameters layer by layer or loading the whole parameters in RAM before quantizing, the problem will solve.