Closed BingxuZhu closed 10 months ago
Hi @BingxuZhu ,
Thx for your analysis on the output log. Actually it is not our code bug, but coarse measurement of CPU virtual memory usage.
Basically, we report CPU virtual memory usage using psutil
python package as the code line here
psutil.virtual_memory()
monitor Global CPU virtual memory usage, Not only our deepSpeed single process memory usage, thus the measurement is too coarse for you to collect deepspeed process's CPU virtual memory usage. And that is why you see in your log above,Before initializing optimizer states
there is already around 20% cpu virtual memory usage.
Hope it answers your questions.
I really appreciate your reply @GuanhuaWang thank you so much!
So is there a better solution, or fine-grained monitoring of deepspeed processes?
Hi @GuanhuaWang,
I found an error that when I was testing the ratio
parameter, using the most basic examples provided by Huggingface transformers(which provides the Deepspeed integration link here)
When the ratio parameter is set from 0.0 to 0.9
, each GPU can run to full memory size and the training takes about 2 minutes. When the ratio parameter is set to 1.0
, the each GPU memory is only about 60% and the training takes about 12 minutes. Here is my script to execute it.
#!/bin/bash
deepspeed --hostfile=hostfile --num_nodes=1 --num_gpus 8 examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero3.json \
--model_name_or_path t5-3b --per_device_train_batch_size 1 \
--output_dir output_dir1 --overwrite_output_dir --fp16 \
--do_train --max_train_samples 300 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " \
--source_lang en --target_lang ro \
--learning_rate 5e-7
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 5e-7,
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true,
"ratio": 0.0 ##only change it
}
},
"prescale_gradients": false,
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
This is an unreasonable phenomenon for the offload-ratio parameter function. Although you mentioned above that the psutil
python package is a coarse-grained measure of the Node's CPU virtual memory, the training time is indeed a contradictory phenomenon. Why the offload ratio parameter not work? The expected effect should be that different ratios correspond to different CPU virtual memory usage and GPU memory size.
I'm guessing that the Huggingface Deepspeed integration doesn't work well with the latest version of deepspeed library. Can Twin-Offload be used only by Megatron-Deepspeed?
Hello, thank you for your contribution to
twin-offload
. When I tried to runds_pretrain_gpt_2.7B.sh
at Megatron-Deepspeed with the latest parameter"offload_optimizer":"ratio"
, I tried to set the parameter valuefrom 0.0 to 1.0
. I found that when it was training, the CPU Virtual Memory was the same when the ratio parameter was setfrom 0.0 to 0.4
, and the cpu usage was the same when the ratio parameter was setfrom 0.5 to 1.0
. Here's what happened with the scripts and arguments I used and cpu usage.If I set the ratio parameter to
0.0, 0.1, 0.2, 0.3, 0.4
, the CPU Virtual Memory in the output log is about 51GB, and percent is about 27%. It seems that cpu memory decreases After initializing optimizer states. Why?Similarly, when I set the ratio parameter to
0.5, 0.6, 0.7, 0.8, 0.9, 1.0
, the CPU Virtual Memory in the output log is about 89GB, the percent is about 47%For the
ds_pretrain_gpt_2.7B.sh
script: Compared with the350M.sh
script ofZero-offload ++ Tutorials
inoffload_pp
directory, I only changed itsmodel size
and some necessary dataset configuration. I don't know why this happened. I am eager to use Twin-Flow partial offload function, hope you can answer me, thank youThis is my lab environment: Tesla V100-SXM2-16GB * 8, Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz, CPU total Memory 187Gb. Deepspeed0.12.4