mlcommons / training_results_v4.0

This repository contains the results and code for the MLPerf™ Training v4.0 benchmark.
https://mlcommons.org/benchmarks/training
Apache License 2.0
12 stars 14 forks source link

Cuda error when reproducing Llama2 70B finetune #7

Open asesorov opened 3 weeks ago

asesorov commented 3 weeks ago

Hello, I'm currently trying to reproduce NVIDIA Llama2 70B results on DGX H100. I applied fixes from https://github.com/mlcommons/training_results_v4.0/issues/5, but face the CUDA issue:

Failed: Cuda error /workspace/ft-llm/TransformerEngine/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:300 'operation not supported'

Full logs: mlcommons_lora.log

Running on single node with 2x GPUs, so here's the environment config: config_1_node_DGXH100.txt

matthew-frank commented 6 days ago

Llama2 70B doesn't fit on 2 GPUs. Please use all 8 gpus in the DGX H100 node.

The configuration file for DGX H100 is: https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/config_DGXH100_1x8x4xtp4pp1cp1.sh