mlcommons / training_results_v4.0

This repository contains the results and code for the MLPerf™ Training v4.0 benchmark.
https://mlcommons.org/benchmarks/training
Apache License 2.0
12 stars 15 forks source link

Error When Reproducing Nvidia's LLama2-70b-LoRA Results #5

Open mrmhodak opened 2 months ago

mrmhodak commented 2 months ago

Hello,

when trying to reproduce Nvidia's results on DGX H100, the code cannot be executed and results in Segmentation Fault - see attached file

We have found that the error disappears when TP_COMM_OVERLAP is set to FALSE, but then the time is 38 minutes, instead of about ~28 minutes.

Please help us resolve. mpi_error_message_1 2.txt

mmarcinkiewicz commented 2 months ago

Are you using slurm+enroot+pyxis? Or docker?

EDIT: I see "docker exec". You need slurm+enroot+pyxis to enable TP_COMM_OVERLAP and get the perf

mmarcinkiewicz commented 2 months ago

How did you even get the docker run? We deprecated that file and have not included it in our submission? Did you dig it up from our previous submissions?

mrmhodak commented 2 months ago

Hi, we have a single node system so we looked at Dell's submission which has directions for single node execution using docker: https://github.com/mlcommons/training_results_v4.0/tree/main/Dell/benchmarks/llama2_70b_lora/implementations/pytorch

My assumption is that they worked with Nvidia

mmarcinkiewicz commented 2 months ago

Ok, I see. Dell also ran with the overlap off, and it indeed cost them ~2 minutes.

We are pretty busy with our new submission so I don't think we have the time to chase it that late. Two quick questions:

  1. Is the data on raid? Is the raid array configured as RAID0?
  2. How did you get the container running? There were issues with floating libraries in one of our dependencies. We'll sort it out for the next submission but sadly it's non-trivial to repro old results. Maybe you've made some changes there to make it work?
blevai commented 2 months ago

The docker approach won't work, The docker approach in the Dell 4.0 submission is most likely some leftover code from some failed attempt to bypass and use docker instead running the same training code.

However, when we tried to run the submitted code with slurm (slurm+enroot+pyxis), we ran into some permission issues.

slurm-32-1.txt

@mmarcinkiewicz could you please check the logs,? Maybe we did some trivial mistake.

nehmathe commented 2 months ago

Hi @mmarcinkiewicz,

  1. Yes, the data is on raid and configured as RAID0.
balazslevai-htec commented 1 month ago

Hi @mmarcinkiewicz,

So to be able to have a try on using slurm + enroot + pyxis we had to do some changes to the submission:

After these modifications we ran into another python error that we cannot figure out

slurm-219.txt

Could you please take a look at it?

mmarcinkiewicz commented 1 month ago

Here's how to build a working container (tested on our side):

Replace the upstream NeMo https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/Dockerfile#L25

RUN git clone https://github.com/NVIDIA/NeMo.git && \
    cd NeMo && \
    echo NEMO_REVISION=${NEMO_REVISION} && \
    git checkout ${NEMO_REVISION} && \
    echo NEMO_COMMIT_HASH=$(git rev-parse HEAD) && \
    pip install --no-build-isolation -e ".[nlp]"

with

RUN git clone https://github.com/ggruza/NeMo.git && \
    cd NeMo && \
    echo NEMO_REVISION=${NEMO_REVISION} && \
    git checkout v2.0.0.rc0.beta_modified && \
    pip install --no-build-isolation -e ".[nlp]"

(please mind that the fork won't be there forever)

Also, please go to https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/requirements.txt and add


botocore==1.34.104
datasets==2.19.1
huggingface-hub==0.23.0
inflect==7.2.1
more-itertools==10.2.0
numcodecs==0.12.1
portalocker==2.8.2
pretty-errors==1.2.25
pytorch-lightning==2.2.4
requests==2.31.0
s3transfer==0.10.1
safetensors==0.4.3
sentry-sdk==2.1.1
torchmetrics==1.4.0
tqdm==4.66.2
transformers==4.40.2
typeguard==4.2.1
wandb==0.17.0

sorry, I know it's a hassle, we've fixed that in 4.1

mmarcinkiewicz commented 1 month ago

@blevai there's

0: g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so
0: /usr/bin/ld: cannot open output file helpers.cpython-310-x86_64-linux-gnu.so: Read-only file system
0: collect2: error: ld returned 1 exit status

in your log. Can you make /usr/bin writable?

mmarcinkiewicz commented 1 month ago

@balazslevai-htec please build a new container according to the recipe provided above

matthew-frank commented 1 month ago

Note that the only change between https://github.com/NVIDIA/NeMo.git@v2.0.0.rc0.beta and git@github.com:ggruza/NeMo.git@v2.0.0.rc0.beta_modified is to pin the versions in the requirements/requirements_nlp.txt file:

$ diff -r NeMo/ ggruza-NeMo/
diff -r NeMo/requirements/requirements_nlp.txt ggruza-NeMo/requirements/requirements_nlp.txt
1,12c1,12
< boto3
< einops
< faiss-cpu
< fasttext
< flask_restful
< ftfy
< gdown
< h5py
< ijson
< jieba
< markdown2
< matplotlib>=3.3.2
---
> boto3==1.34.104
> einops==0.7.0
> faiss-cpu==1.8.0
> fasttext==0.9.2
> flask_restful==0.3.10
> ftfy==6.2.0
> gdown==5.2.0
> h5py==3.11.0
> ijson==3.2.3
> jieba==0.42.1
> markdown2==2.4.13
> matplotlib==3.8.4
14,22c14,22
< nltk>=3.6.5
< opencc<1.1.7
< pangu
< rapidfuzz
< rouge_score
< sacrebleu  # manually install sacrebleu[ja] for Japanese support; MeCab is unsupported in Python 3.11+
< sentence_transformers
< tensorstore<0.1.46
< zarr
---
> nltk==3.8.1
> opencc==1.1.6
> pangu==4.0.6.1
> rapidfuzz==3.9.0
> rouge_score==0.1.2
> sacrebleu==2.4.2
> sentence_transformers==2.7.0
> tensorstore==0.1.45
> zarr==2.18.0
balazslevai-htec commented 1 month ago

Hi @matthew-frank and @mmarcinkiewicz,

thank you for the support. Regarding the error message: /usr/bin/ld: cannot open output file helpers.cpython-310-x86_64-linux-gnu.so: Read-only file system, /usr/bin/ld is the linker for c++, the permission issue is in NeMo/nemo/collections/nlp/data/language_modeling/megatron , there is no write permission for that after cloning, where the training tries to compile helpers.cpp during runtime, anyway I added this compilation to the Dockerfile, so that's not a problem anymore.

Beside the above, I followed the docker recipe modifications to the letter but received the same error message only in a different format:

 0: attention.py 2399 forward
 0: out_fp8, aux_ctx_tensors = fused_attn_fwd(
 0: 
 0: fused_attn.py 853 fused_attn_fwd
 0: output_tensors = tex.fused_attn_fwd(
 0: 
 0: RuntimeError:
 0: /workspace/ft-llm/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu:2066 in function operator(): cuDNN Error: Tensor 'sdpa_fp8::Amax_O' strides not set.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

The complete log is log-236.txt

mmarcinkiewicz commented 1 month ago

Can you dump printenv and attach as a file?

nehmathe commented 1 month ago

Hi @mmarcinkiewicz,

Here is the printenv dump: env_output.txt

Thanks.

mrmhodak commented 1 month ago

Any suggestions: @matthew-frank @mmarcinkiewicz ?

mmarcinkiewicz commented 1 month ago

I don't see anything suspicious. Is there a way you can share the container with us? Either push to dockerhub or as a sqsh file?

Also, a random idea - does your node have python installed? Sometimes, enroot for some reason uses python from the root instead of the container. Adding --no-container-mount-home to your srun command sometimes help

mrmhodak commented 1 month ago

@mmarcinkiewicz: I have sent a container location to @ShriyaPalsamudram over email - we do not want to share publicly and I do not have your email. Please share internally and let us know.

mrmhodak commented 1 month ago

@mmarcinkiewicz : Any update?

mmarcinkiewicz commented 1 month ago

We were able to repro. Trying to understand what's the difference

matthew-frank commented 1 month ago

those gdrcopy open failed lines are really suspicious. I have no idea where those are coming from.

zhenghuanbo commented 1 month ago

@mmarcinkiewicz :I get the same error /workspace/ft-llm/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu:2066 in function operator(): cuDNN Error: Tensor 'sdpa_fp8::Amax_O' strides not set.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

new issues: https://github.com/mlcommons/training_results_v4.0/issues/6

mmarcinkiewicz commented 1 month ago

@mrmhodak @blevai @zhenghuanbo it seems that TE had submodules that have not been frozen. Here's the recipe to fix it. Please modify the TE install block in the dockerfile to the following:

ARG TE_REVISION=v1.6rc2
ENV CUSTOM_TE_REVISION ${TE_REVISION}

ARG CUDNN_FRONTEND_REVISION=1b0b5eac540b7f8fd19b18f1e6b8427c95503348
ENV CUSTOM_CUDNN_FRONTEND_REVISION ${CUDNN_FRONTEND_REVISION}

ARG GTEST_REVISION=f8d7d77c06936315286eb55f8de22cd23c188571
ENV CUSTOM_GTEST_REVISION ${GTEST_REVISION}

RUN if [ "${TE_REVISION}" != SKIP ]; then \
      git clone https://github.com/NVIDIA/TransformerEngine.git && \
      cd TransformerEngine && \
      git submodule init && git submodule update && \
      echo TE_REVISION=${TE_REVISION} && \
      git checkout ${CUSTOM_TE_REVISION} && \
      # Checkout specific commit for cudnn-frontend submodule
      cd 3rdparty/cudnn-frontend && \
      git checkout ${CUSTOM_CUDNN_FRONTEND_REVISION} && \
      echo CUDNN_FRONTEND_COMMIT_HASH=$(git rev-parse HEAD) && \
      cd - && \
      # Checkout specific commit for googletest submodule
      cd 3rdparty/googletest && \
      git checkout ${CUSTOM_GTEST_REVISION} && \
      echo GTEST_COMMIT_HASH=$(git rev-parse HEAD) && \
      cd - && \
      echo TE_COMMIT_HASH=$(git rev-parse HEAD) && \
      NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install --force-reinstall --no-deps . \
    ; fi

please retry

mrmhodak commented 1 month ago

@mmarcinkiewicz : This works for us - thanks!

One more think @matthew-frank pointed out that we still have errors with "gdrcopy open failed". We have that installed - any ideas what is going on there?

zhenghuanbo commented 1 month ago

@mmarcinkiewicz Thank you very much, the error is resolved.