Open VincentVanNF opened 1 month ago
For the 72B text model, I used 4 * 4 H100 96GB. CPT was fine but it went OOM in SFT (cutoff = 4096 and DS zero3)
It is fine for me to use LoRA for SFT, so it doesn't matter to me.
If we are seeing OOM on the 72B text model with 256GB more vRAM than your setup, your VL will likely not work. (Correct me if I'm wrong)
Can you help provide your config for 72B LoRA for SFT? @jedcheng
Reminder
System Info
根据文档估计的显卡资源70B 模型大概需要600G,目前已经使用了双机 A100 80G总共16卡,远远超过600G,但是每张卡仍然会OOM: 训练脚本:
使用的ds_z3_config.json:
报错:
Reproduction
Expected behavior
No response
Others
No response