Open mrmhodak opened 2 months ago
Are you using slurm+enroot+pyxis? Or docker?
EDIT: I see "docker exec". You need slurm+enroot+pyxis to enable TP_COMM_OVERLAP and get the perf
How did you even get the docker run? We deprecated that file and have not included it in our submission? Did you dig it up from our previous submissions?
Hi, we have a single node system so we looked at Dell's submission which has directions for single node execution using docker: https://github.com/mlcommons/training_results_v4.0/tree/main/Dell/benchmarks/llama2_70b_lora/implementations/pytorch
My assumption is that they worked with Nvidia
Ok, I see. Dell also ran with the overlap off, and it indeed cost them ~2 minutes.
We are pretty busy with our new submission so I don't think we have the time to chase it that late. Two quick questions:
The docker approach won't work, The docker approach in the Dell 4.0 submission is most likely some leftover code from some failed attempt to bypass and use docker instead running the same training code.
However, when we tried to run the submitted code with slurm (slurm+enroot+pyxis), we ran into some permission issues.
@mmarcinkiewicz could you please check the logs,? Maybe we did some trivial mistake.
Hi @mmarcinkiewicz,
Hi @mmarcinkiewicz,
So to be able to have a try on using slurm + enroot + pyxis we had to do some changes to the submission:
After these modifications we ran into another python error that we cannot figure out
Could you please take a look at it?
Here's how to build a working container (tested on our side):
Replace the upstream NeMo https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/Dockerfile#L25
RUN git clone https://github.com/NVIDIA/NeMo.git && \
cd NeMo && \
echo NEMO_REVISION=${NEMO_REVISION} && \
git checkout ${NEMO_REVISION} && \
echo NEMO_COMMIT_HASH=$(git rev-parse HEAD) && \
pip install --no-build-isolation -e ".[nlp]"
with
RUN git clone https://github.com/ggruza/NeMo.git && \
cd NeMo && \
echo NEMO_REVISION=${NEMO_REVISION} && \
git checkout v2.0.0.rc0.beta_modified && \
pip install --no-build-isolation -e ".[nlp]"
(please mind that the fork won't be there forever)
Also, please go to https://github.com/mlcommons/training_results_v4.0/blob/main/NVIDIA/benchmarks/llama2_70b_lora/implementations/nemo/requirements.txt and add
botocore==1.34.104
datasets==2.19.1
huggingface-hub==0.23.0
inflect==7.2.1
more-itertools==10.2.0
numcodecs==0.12.1
portalocker==2.8.2
pretty-errors==1.2.25
pytorch-lightning==2.2.4
requests==2.31.0
s3transfer==0.10.1
safetensors==0.4.3
sentry-sdk==2.1.1
torchmetrics==1.4.0
tqdm==4.66.2
transformers==4.40.2
typeguard==4.2.1
wandb==0.17.0
sorry, I know it's a hassle, we've fixed that in 4.1
@blevai there's
0: g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so
0: /usr/bin/ld: cannot open output file helpers.cpython-310-x86_64-linux-gnu.so: Read-only file system
0: collect2: error: ld returned 1 exit status
in your log. Can you make /usr/bin writable?
@balazslevai-htec please build a new container according to the recipe provided above
Note that the only change between https://github.com/NVIDIA/NeMo.git@v2.0.0.rc0.beta and git@github.com:ggruza/NeMo.git@v2.0.0.rc0.beta_modified is to pin the versions in the requirements/requirements_nlp.txt file:
$ diff -r NeMo/ ggruza-NeMo/
diff -r NeMo/requirements/requirements_nlp.txt ggruza-NeMo/requirements/requirements_nlp.txt
1,12c1,12
< boto3
< einops
< faiss-cpu
< fasttext
< flask_restful
< ftfy
< gdown
< h5py
< ijson
< jieba
< markdown2
< matplotlib>=3.3.2
---
> boto3==1.34.104
> einops==0.7.0
> faiss-cpu==1.8.0
> fasttext==0.9.2
> flask_restful==0.3.10
> ftfy==6.2.0
> gdown==5.2.0
> h5py==3.11.0
> ijson==3.2.3
> jieba==0.42.1
> markdown2==2.4.13
> matplotlib==3.8.4
14,22c14,22
< nltk>=3.6.5
< opencc<1.1.7
< pangu
< rapidfuzz
< rouge_score
< sacrebleu # manually install sacrebleu[ja] for Japanese support; MeCab is unsupported in Python 3.11+
< sentence_transformers
< tensorstore<0.1.46
< zarr
---
> nltk==3.8.1
> opencc==1.1.6
> pangu==4.0.6.1
> rapidfuzz==3.9.0
> rouge_score==0.1.2
> sacrebleu==2.4.2
> sentence_transformers==2.7.0
> tensorstore==0.1.45
> zarr==2.18.0
Hi @matthew-frank and @mmarcinkiewicz,
thank you for the support. Regarding the error message: /usr/bin/ld: cannot open output file helpers.cpython-310-x86_64-linux-gnu.so: Read-only file system, /usr/bin/ld is the linker for c++, the permission issue is in NeMo/nemo/collections/nlp/data/language_modeling/megatron , there is no write permission for that after cloning, where the training tries to compile helpers.cpp during runtime, anyway I added this compilation to the Dockerfile, so that's not a problem anymore.
Beside the above, I followed the docker recipe modifications to the letter but received the same error message only in a different format:
0: attention.py 2399 forward
0: out_fp8, aux_ctx_tensors = fused_attn_fwd(
0:
0: fused_attn.py 853 fused_attn_fwd
0: output_tensors = tex.fused_attn_fwd(
0:
0: RuntimeError:
0: /workspace/ft-llm/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu:2066 in function operator(): cuDNN Error: Tensor 'sdpa_fp8::Amax_O' strides not set.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
The complete log is log-236.txt
Can you dump printenv and attach as a file?
Any suggestions: @matthew-frank @mmarcinkiewicz ?
I don't see anything suspicious. Is there a way you can share the container with us? Either push to dockerhub or as a sqsh file?
Also, a random idea - does your node have python installed? Sometimes, enroot for some reason uses python from the root instead of the container. Adding --no-container-mount-home
to your srun
command sometimes help
@mmarcinkiewicz: I have sent a container location to @ShriyaPalsamudram over email - we do not want to share publicly and I do not have your email. Please share internally and let us know.
@mmarcinkiewicz : Any update?
We were able to repro. Trying to understand what's the difference
those gdrcopy open failed
lines are really suspicious. I have no idea where those are coming from.
@mmarcinkiewicz :I get the same error
/workspace/ft-llm/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu:2066 in function operator(): cuDNN Error: Tensor 'sdpa_fp8::Amax_O' strides not set.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment
.
new issues: https://github.com/mlcommons/training_results_v4.0/issues/6
@mrmhodak @blevai @zhenghuanbo it seems that TE had submodules that have not been frozen. Here's the recipe to fix it. Please modify the TE install block in the dockerfile to the following:
ARG TE_REVISION=v1.6rc2
ENV CUSTOM_TE_REVISION ${TE_REVISION}
ARG CUDNN_FRONTEND_REVISION=1b0b5eac540b7f8fd19b18f1e6b8427c95503348
ENV CUSTOM_CUDNN_FRONTEND_REVISION ${CUDNN_FRONTEND_REVISION}
ARG GTEST_REVISION=f8d7d77c06936315286eb55f8de22cd23c188571
ENV CUSTOM_GTEST_REVISION ${GTEST_REVISION}
RUN if [ "${TE_REVISION}" != SKIP ]; then \
git clone https://github.com/NVIDIA/TransformerEngine.git && \
cd TransformerEngine && \
git submodule init && git submodule update && \
echo TE_REVISION=${TE_REVISION} && \
git checkout ${CUSTOM_TE_REVISION} && \
# Checkout specific commit for cudnn-frontend submodule
cd 3rdparty/cudnn-frontend && \
git checkout ${CUSTOM_CUDNN_FRONTEND_REVISION} && \
echo CUDNN_FRONTEND_COMMIT_HASH=$(git rev-parse HEAD) && \
cd - && \
# Checkout specific commit for googletest submodule
cd 3rdparty/googletest && \
git checkout ${CUSTOM_GTEST_REVISION} && \
echo GTEST_COMMIT_HASH=$(git rev-parse HEAD) && \
cd - && \
echo TE_COMMIT_HASH=$(git rev-parse HEAD) && \
NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install --force-reinstall --no-deps . \
; fi
please retry
@mmarcinkiewicz : This works for us - thanks!
One more think @matthew-frank pointed out that we still have errors with "gdrcopy open failed". We have that installed - any ideas what is going on there?
@mmarcinkiewicz Thank you very much, the error is resolved.
Hello,
when trying to reproduce Nvidia's results on DGX H100, the code cannot be executed and results in Segmentation Fault - see attached file
We have found that the error disappears when TP_COMM_OVERLAP is set to FALSE, but then the time is 38 minutes, instead of about ~28 minutes.
Please help us resolve. mpi_error_message_1 2.txt