Open srikanthmalla opened 1 year ago
Hi @HeyangQin and @tjruwase , Could you help me resolve this issue? This is happening irrespective of training/inference, happens only during parameter offloading to NVMe.
@srikanthmalla, how many GPUs are you running? Also, please share the impact of the following ds_config adjustments
stage3_max_reuse_distance
to 0buffer_count
in offload_param
to 10, 15, and 20. @srikanthmalla, did either of these changes help?
Hi @tjruwase, both didn't help.
Thanks for the update. I was able to reproduce the problem. My initial look suggests that this is due to the optimization of prefetching and caching layer parameters to reduce offload overheads. The following error message shows that the buffer_count:5
setting of offload_param
is eventually exceeded:
The meaning of the above message is that we are unable to add param 258
to the offload cache because it is full, containing params 259, 233, 207, 246, 220
I was able to work around this issue by disabling caching (i.e., "stage3_max_reuse_distance": 0
). But since that did not work for you, I am concerned perhaps I have reproduced a different problem. Can you please try the following to help this investigation?
"stage3_max_reuse_distance": 0
) and disabled prefetching ("stage3_prefetch_bucket_size": 0
) to see if that combination avoids the error. @srikanthmalla, just to clarify, we recognize that there is an underlying issue that needs to be fixed. My requests above are just to help further understand this issue. Thanks!
Hi, I'm getting same issue when using deepspeed 0.10.0 with huggingface transformers.
727AssertionError: Not enough buffers 0 for swapping 1726 assert len(swap_in_paths) <= len(725 File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 297, in swap_in
Deepspeed config:
zero_optimization:
stage: 3
offload_optimizer:
device: nvme
nvme_path: /tmp/nvme_offoad
offload_param:
device: nvme
nvme_path: /tmp/nvme_offoad
Hi, I am currently trying off-the-shelf tranformer example with deepspeed:
The config file ds_config_zero3_nvme_offload.json has zero3 params from the main documentation (https://huggingface.co/docs/transformers/main_classes/deepspeed#zero3-example) website like this:
I get the following error:
I don't get this error if the offload_param device is set to cpu instead of nvme. I am curious why this is happening and how to fix this. Also, this happens regardless I add aio params or remove all of them. Please let me know.
Thank you!