Cuda error when reproducing Llama2 70B finetune

mlcommons / training_results_v4.0

This repository contains the results and code for the MLPerf™ Training v4.0 benchmark.

https://mlcommons.org/benchmarks/training

Apache License 2.0

12 stars 14 forks source link

Open asesorov opened 3 weeks ago

asesorov commented 3 weeks ago

Hello, I'm currently trying to reproduce NVIDIA Llama2 70B results on DGX H100. I applied fixes from https://github.com/mlcommons/training_results_v4.0/issues/5, but face the CUDA issue:

Failed: Cuda error /workspace/ft-llm/TransformerEngine/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:300 'operation not supported'

Running on single node with 2x GPUs, so here's the environment config: config_1_node_DGXH100.txt

matthew-frank commented 6 days ago

Llama2 70B doesn't fit on 2 GPUs. Please use all 8 gpus in the DGX H100 node.