pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.22k stars 863 forks source link

The memory occupied by the model becomes larger after it is loaded into the GPU #2022

Open koking0 opened 1 year ago

koking0 commented 1 year ago

🐛 Describe the bug

We have two GPT2 models. Model 1 has only 110 million parameters, which are stored in 16 bit floating point numbers. Model 2 has 3.5 billion parameters, which are also stored in 16 bit floating point numbers.

However, after the same handler is loaded into the torchserve: latest gpu, Model 2 occupies about 7G of memory, which is consistent with our calculation, but Model 1 actually occupies about 1G of memory, which is far more than we expected.

Error logs

Model 1 size:

$ ll -h conversation/
total 250M
-rw-rw-r-- 1 alex alex  848 12月  5 12:45 config.json
-rw-rw-r-- 1 alex alex 250M 12月  5 12:45 pytorch_model.bin

Model 2 size:

$ ll -h gpt2/
total 6.7G
-rw-rw-r-- 1 alex alex  858 11月 14 18:33 config.json
-rw-rw-r-- 1 alex alex 6.7G 11月 14 18:34 pytorch_model.bin

watch -n 1 nvidia-smi

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    155043      C   /home/venv/bin/python            1227MiB |
|    4   N/A  N/A    396432      C   /home/venv/bin/python            7507MiB |
|    6   N/A  N/A    155043      C   /home/venv/bin/python            1211MiB |
|    7   N/A  N/A    514629      C   /home/venv/bin/python             945MiB |
+-----------------------------------------------------------------------------+

Model 1 in GPU 7 and Model 2 in GPU 4. Do not care about the other two models on GPU0 and GPU6.

Installation instructions

using command:

docker run -it -d --name torchserve --gpus '"device=0,1,2,3,4,5,6,7"' -p 18080:8080 -p 18081:8081 -p 18082:8082 -p 17070:7070 -p 17071:7071 -v ./config.properties:/home/model-server/config.properties -v ./torchserve/model-store:/home/model-server/model-store pytorch/torchserve:latest-gpu

Model Packaing

Model 1: https://huggingface.co/IDEA-CCNL/Wenzhong-GPT2-110M Model 2: https://huggingface.co/IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese

after finetune.

packaing model 1:

torch-model-archiver --model-name insurance_chat    --force --version 1.0 --serialized-file /home/AppealGenerate/saved_model/conversation/pytorch_model.bin  --handler /data/nlg_pipeline/gpt2/dialog/handler.py    --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/home/AppealGenerate/saved_model/conversation/config.json"

packing model 2:

torch-model-archiver --model-name insurance_chat_3.5B    --force --version 1.0 --serialized-file /data/nlg_pipeline/gpt2/dialog/models/finetune/gpt2/pytorch_model.bin  --handler /data/nlg_pipeline/gpt2/dialog/handler.py    --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/data/nlg_pipeline/gpt2/dialog/models/finetune/gpt2/config.json"

The two commands are very similar. The only difference is the path of the model.

config.properties

No response

Versions

$ docker exec torchserve  pip list
Package                Version
---------------------- ------------
accelerate             0.15.0
captum                 0.5.0
diffusers              0.9.0
huggingface-hub        0.11.1
jieba                  0.42.1
matplotlib             3.5.2
matplotlib-inline      0.1.6
numpy                  1.22.4
oss2                   2.16.0
packaging              21.3
pandas                 1.4.2
Pillow                 9.0.1
pip                    22.3.1
pycryptodome           3.16.0
requests               2.28.0
rouge                  1.0.1
scikit-learn           1.1.3
scipy                  1.9.3
sentence-transformers  2.2.2
sentencepiece          0.1.97
tokenizers             0.13.2
torch                  1.11.0+cu102
torch-model-archiver   0.6.0
torchserve             0.6.0
torchtext              0.12.0
torchvision            0.12.0+cu102
transformers           4.25.1

Repro instructions

The two models have similar operations.

git pull
curl -X DELETE http://localhost:18081/models/insurance_chat
rm -rf /data/liuzhaofeng/torchserve/model-store/insurance_chat.mar

torch-model-archiver --model-name insurance_chat    --force --version 1.0 --serialized-file /home/AppealGenerate/saved_model/conversation/pytorch_model.bin  --handler /data/nlg_pipeline/gpt2/dialog/handler.py    --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/home/AppealGenerate/saved_model/conversation/config.json"

curl -X POST "http://localhost:18081/models?url=insurance_chat.mar"
curl -X PUT "http://localhost:18081/models/insurance_chat?min_worker=1"

Possible Solution

Shouldn't Model 1 only use 200+M of memory?

mreso commented 1 year ago

Hi, I think what you're seeing is the memory overhead of the CUDA context which is created after you execute the first CUDA related op (see e.g. here). The size of the context is device dependent but on a V100 its about 807MB for me. For the bigger model its still there, but it does not stand out that much. You can find out what the size of the CUDA context for your device is by creating a small 1 element tensor on GPU and then look at the nvidia-smi output.

To measure the actual memory occupied by the tensors of your model etc you can use torch.cuda.memory_summary() or torch.cuda.memory_allocated(). See here for more details.