Open ojh31 opened 1 month ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Run the following on a 2*H100 node:
accelerate_config.yaml:
foo.py:
Expected behavior
The machine should not throw a CUDA OOM error. The models take up ~4*(14+7)=84GB which should comfortably fit on a 2x80gb machine. I can load the 7b model after setting
torch.cuda.set_per_process_memory_fraction(0.25)
for example. Somehow trying to load both is causing a massive memory spike.