Open koking0 opened 1 year ago
Hi, I think what you're seeing is the memory overhead of the CUDA context which is created after you execute the first CUDA related op (see e.g. here). The size of the context is device dependent but on a V100 its about 807MB for me. For the bigger model its still there, but it does not stand out that much. You can find out what the size of the CUDA context for your device is by creating a small 1 element tensor on GPU and then look at the nvidia-smi output.
To measure the actual memory occupied by the tensors of your model etc you can use torch.cuda.memory_summary() or torch.cuda.memory_allocated(). See here for more details.
🐛 Describe the bug
We have two GPT2 models. Model 1 has only 110 million parameters, which are stored in 16 bit floating point numbers. Model 2 has 3.5 billion parameters, which are also stored in 16 bit floating point numbers.
However, after the same handler is loaded into the torchserve: latest gpu, Model 2 occupies about 7G of memory, which is consistent with our calculation, but Model 1 actually occupies about 1G of memory, which is far more than we expected.
Error logs
Model 1 size:
Model 2 size:
watch -n 1 nvidia-smi
Model 1 in GPU 7 and Model 2 in GPU 4. Do not care about the other two models on GPU0 and GPU6.
Installation instructions
using command:
Model Packaing
Model 1: https://huggingface.co/IDEA-CCNL/Wenzhong-GPT2-110M Model 2: https://huggingface.co/IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese
after finetune.
packaing model 1:
packing model 2:
The two commands are very similar. The only difference is the path of the model.
config.properties
No response
Versions
Repro instructions
The two models have similar operations.
Possible Solution
Shouldn't Model 1 only use 200+M of memory?