Closed ccruttjr closed 5 months ago
Just to be sure I understand correctly, you want to use DDP and you run out of memory? Could you please paste the full error message you get?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Accelerate works when I use non-distributed training, any DeepSpeed, and FSDP. It does not, however, work with just selecting multi-gpu and putting all settings to default. They seem to be running out of VRAM, even though there should be PLENTY of space. Here are the yaml config files that worked/didn't work... followed by the code and the error statement. I tried it with and without
NCCL_P2P_DISABLE=1
to see if that changed anything but to no avail. Also, jeez is running it solo so much fast haha. I'd love to find what the issue is. I don't seem to be using up all my CPU ram or processing power- and running it solo doesn't even use half of what I need according tonvidia-smi
andaccelerate estimate-memory
with TinyLlama.non-distributed (works)
base-distrubuted (doesn't work)
0 DeepSpeed ZeRO (works)
FSDP (works)
Here's the code.
I'd run it via
Here's an example of my
example/preprocessed_data.json
(not real data)Expected behavior
.