My deepspeed code is very slow

zhaowei-wang-nlp commented 2 years ago

2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time

Hi everyone, I am using Zero 3-stage. I can see the above message every step. The training process is very slow. How to change my config to speed up? My config: { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 5e8, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 5e8, "stage3_max_reuse_distance": 5e8, "stage3_gather_fp16_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

CaralHsi commented 1 year ago

same problem

tjruwase commented 1 year ago

Pytorch allocator cache flushes are very expensive but indicate severe memory pressure. Can you try reducing batch size?

YiAthena commented 1 year ago

same problem

zhangyanbo2007 commented 1 year ago

same problem

zhangyanbo2007 commented 1 year ago

same problem

joanrod commented 1 year ago

Same issue here, any updates?

lusongshuo-mt commented 1 year ago

same problem

iamlockelightning commented 1 year ago

👀

teaguexiao commented 1 year ago

same

dnaihao commented 1 year ago

Any update on this issue? I am using Pytorch Lightning, originally I thought it is because I am passing too many things for each step, but after I change those, the problem is still there.

I have tried reducing the batch size, and also changing the pin_memory to False according to https://discuss.pytorch.org/t/when-to-set-pin-memory-to-true/19723 (some pytorch version has that issue), but with no luck.

teaguexiao commented 1 year ago

I used 8xA100 with same settings and this message would gone..

dnaihao commented 1 year ago

Thanks @teaguexiao, I will try using more GPUs (but ours are A40 of 48 GB memory each) to see if that can help. Thanks for sharing!

bingwork commented 11 months ago

I use 8*A100 40G, and reduce the batch size, and wait about 20 minutes, then there's no such message now.

wulaoshi commented 10 months ago

same problem

ghost commented 8 months ago

same problem here

torch==2.2.1
transformers==4.38.2
tokenizers==0.15.2
huggingface-hub==0.21.3
bitsandbytes==0.42.0
cloudpickle==3.0.0
accelerate==0.27.1
flash-attn==2.5.6
deepspeed==0.13.4
datasets==2.17.0
loralib==0.1.2
einops==0.7.0
peft==0.9.0
trl==0.7.10

Using deepspeedtorchdistributor in databricks, loading the model with flash-attn 2

ed-00 commented 8 months ago

same issue here am running on 8 MI250X AMD GPUS with 128 GB VRAM

Sander-houqi commented 7 months ago

same problem.

Chenhong-Zhang commented 4 months ago

For those who are still concerned about this issue, try setting your train_batch_size lower. It worked for me.

zimenglan-sysu-512 commented 4 months ago

same problem using 8 v100 gpus

heya5 commented 4 months ago

Same problem. I would like to know whether this issue will deteriorate the model's performance or if it only affects the training efficiency.

absorbguo commented 3 months ago

same issue: deepspeed: torch:2.1.0.dev20230424+cu117 deepspeed:0.11.0

minuenergy commented 2 months ago

2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time

Hi everyone, I am using Zero 3-stage. I can see the above message every step. The training process is very slow. How to change my config to speed up? My config: { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 5e8, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 5e8, "stage3_max_reuse_distance": 5e8, "stage3_gather_fp16_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

is there any solution in here?

i have same problem

Se-Hun commented 1 month ago

Same problem using 2 H100 gpus.

Joe-Hall-Lee commented 1 month ago

Same problem.

likejazz commented 1 month ago

Same here.

microsoft / DeepSpeedExamples

My deepspeed code is very slow #172