Open asesorov opened 3 weeks ago
Llama2 70B doesn't fit on 2 GPUs. Please use all 8 gpus in the DGX H100 node.
The configuration file for DGX H100 is: https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/config_DGXH100_1x8x4xtp4pp1cp1.sh
Hello, I'm currently trying to reproduce NVIDIA Llama2 70B results on DGX H100. I applied fixes from https://github.com/mlcommons/training_results_v4.0/issues/5, but face the CUDA issue:
Full logs: mlcommons_lora.log
Running on single node with 2x GPUs, so here's the environment config: config_1_node_DGXH100.txt