mlcommons / training_results_v3.1

This repository contains the results and code for the MLPerf™ Training v3.1 benchmark.
https://mlcommons.org/benchmarks/training
Apache License 2.0
17 stars 10 forks source link

Docker build issue in NVIDIA DLRM DCNv2 #3

Open rgandikota opened 11 months ago

rgandikota commented 11 months ago

Benchmark

High-level error message

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for mpi4py Successfully built mlperf-logging mlperf-common Failed to build mpi4py ERROR: Could not build wheels for mpi4py, which is required to install pyproject.toml-based projects

Environment

Docker build logs:

DEPRECATED: The legacy builder is deprecated and will be removed in a future release. Install the buildx component to build images with BuildKit: https://docs.docker.com/go/buildx/

Sending build context to Docker daemon 89.6kB Step 1/24 : ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.09-py3 Step 2/24 : FROM ${FROM_IMAGE_NAME} ---> c61ed1549935 Step 3/24 : ARG SM="80;90" ---> Using cache ---> a6af93610cc5 Step 4/24 : ARG ENABLE_MULTINODES=ON ---> Using cache ---> 9a4c20343726 Step 5/24 : ARG HWLOC_VERSION=2.4.1 ---> Using cache ---> d22102c266d5 Step 6/24 : ARG RELEASE=true ---> Using cache ---> 9f3aa7cdc8a9 Step 7/24 : RUN apt-get update -y && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends clang-format libboost-serialization-dev libtbb-dev libaio-dev libgflags-dev zlib1g-dev libbz2-dev libsnappy-dev liblz4-dev libzstd-dev zlib1g-dev libzstd-dev libssl-dev libsasl2-dev && rm -rf /var/lib/apt/lists/* ---> Using cache ---> 9938695d9093 Step 8/24 : ENV PATH=/usr/local/bin:$PATH ---> Using cache ---> 361beb4a8cd3 Step 9/24 : RUN cd /opt/hpcx/ompi/include/openmpi/opal/mca/hwloc/hwloc201 && rm -rfv hwloc201.h hwloc/include/hwloc.h ---> Using cache ---> ff9b56a37f54 Step 10/24 : RUN mkdir -p /var/tmp && wget -q -nc --no-check-certificate -P /var/tmp https://download.open-mpi.org/release/hwloc/v2.4/hwloc-${HWLOC_VERSION}.tar.gz && mkdir -p /var/tmp && tar -x -f /var/tmp/hwloc-${HWLOC_VERSION}.tar.gz -C /var/tmp && cd /var/tmp/hwloc-${HWLOC_VERSION} && ./configure CPPFLAGS="-I/usr/local/cuda/include/ -L/usr/local/cuda/lib64/" LDFLAGS="-L/usr/local/cuda/lib64" --enable-cuda && make -j$(nproc) && make install && rm -rf /var/tmp/hwloc-${HWLOC_VERSION} /var/tmp/hwloc-${HWLOC_VERSION}.tar.gz ---> Using cache ---> 12f9044dd0fe Step 11/24 : ENV CPATH=/usr/local/include:$CPATH ---> Using cache ---> f23ea8b4ae84 Step 12/24 : ENV NCCL_LAUNCH_MODE=PARALLEL ---> Using cache ---> 33ebca8d749d Step 13/24 : ENV SHARP_COLL_NUM_COLL_GROUP_RESOURCE_ALLOC_THRESHOLD=0 SHARP_COLL_LOCK_ON_COMM_INIT=1 SHARP_COLL_LOG_LEVEL=3 HCOLL_ENABLE_MCAST=0 ---> Using cache ---> 235a7d075f60 Step 14/24 : RUN ln -s /usr/lib/x86_64-linux-gnu/libibverbs.so.1.14.39.0 /usr/lib/x86_64-linux-gnu/libibverbs.so ---> Using cache ---> 391154037cc3 Step 15/24 : WORKDIR /workspace/dlrm ---> Using cache ---> 0ec34207b58c Step 16/24 : COPY . . ---> Using cache ---> 16d7e048986a Step 17/24 : RUN pip3 install --no-cache-dir -r requirements.txt ---> Running in cac77ff3f392 Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Collecting git+https://github.com/mlcommons/logging.git@3.1.0-rc1 (from -r requirements.txt (line 1)) Cloning https://github.com/mlcommons/logging.git (to revision 3.1.0-rc1) to /tmp/pip-req-build-llz7jws6 Running command git clone --filter=blob:none --quiet https://github.com/mlcommons/logging.git /tmp/pip-req-build-llz7jws6 Running command git checkout -q b32424904879020a47c8d9813b439e4e3017f8d5 Resolved https://github.com/mlcommons/logging.git to commit b32424904879020a47c8d9813b439e4e3017f8d5 Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Collecting git+https://github.com/NVIDIA/mlperf-common.git (from -r requirements.txt (line 2)) Cloning https://github.com/NVIDIA/mlperf-common.git to /tmp/pip-req-build-wn305jtb Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/mlperf-common.git /tmp/pip-req-build-wn305jtb Resolved https://github.com/NVIDIA/mlperf-common.git to commit 779c29968d9dd08feaa099bf916439558a62a45c Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Collecting mpi4py (from -r requirements.txt (line 3)) Downloading mpi4py-3.1.5.tar.gz (2.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 9.9 MB/s eta 0:00:00 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Requirement already satisfied: pandas>=1.0 in /usr/local/lib/python3.10/dist-packages (from mlperf-logging==3.0.0->-r requirements.txt (line 1)) (1.5.3) Requirement already satisfied: pyyaml>=5.4.1 in /usr/local/lib/python3.10/dist-packages (from mlperf-logging==3.0.0->-r requirements.txt (line 1)) (6.0.1) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from mlperf-logging==3.0.0->-r requirements.txt (line 1)) (1.22.2) Requirement already satisfied: scipy>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from mlperf-logging==3.0.0->-r requirements.txt (line 1)) (1.11.1) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0->mlperf-logging==3.0.0->-r requirements.txt (line 1)) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0->mlperf-logging==3.0.0->-r requirements.txt (line 1)) (2023.3) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=1.0->mlperf-logging==3.0.0->-r requirements.txt (line 1)) (1.16.0) Building wheels for collected packages: mlperf-logging, mlperf-common, mpi4py Building wheel for mlperf-logging (setup.py): started Building wheel for mlperf-logging (setup.py): finished with status 'done' Created wheel for mlperf-logging: filename=mlperf_logging-3.0.0-py3-none-any.whl size=238649 sha256=501be8dee5fba47f9b21d6bdcebba7c219f6afa00b4df9a85ba6735cda9847e0 Stored in directory: /tmp/pip-ephem-wheel-cache-zeu1_sv9/wheels/28/99/ec/54d1122b8daf8ece8026fcc2d28ef65d12ca3cb461d325fd30 Building wheel for mlperf-common (setup.py): started Building wheel for mlperf-common (setup.py): finished with status 'done' Created wheel for mlperf-common: filename=mlperf_common-0.3-py3-none-any.whl size=23720 sha256=79a57fb9ab91667b17c96500ebba5771375e0e4ee43b0bf754e46cecf77de6b5 Stored in directory: /tmp/pip-ephem-wheel-cache-zeu1_sv9/wheels/9b/bb/32/dd53ce122fd18a798e25c0afba97467ffb555bde95bc40cad1 Building wheel for mpi4py (pyproject.toml): started Building wheel for mpi4py (pyproject.toml): finished with status 'error' error: subprocess-exited-with-error

× Building wheel for mpi4py (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [263 lines of output] running bdist_wheel running build running build_src running build_py creating build creating build/lib.linux-x86_64-3.10 creating build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/run.py -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/main.py -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/init.py -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/bench.py -> build/lib.linux-x86_64-3.10/mpi4py creating build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/pool.py -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/aplus.py -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/main.py -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/_base.py -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/init.py -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/_core.py -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/_lib.py -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/server.py -> build/lib.linux-x86_64-3.10/mpi4py/futures creating build/lib.linux-x86_64-3.10/mpi4py/util copying src/mpi4py/util/init.py -> build/lib.linux-x86_64-3.10/mpi4py/util copying src/mpi4py/util/pkl5.py -> build/lib.linux-x86_64-3.10/mpi4py/util copying src/mpi4py/util/dtlib.py -> build/lib.linux-x86_64-3.10/mpi4py/util copying src/mpi4py/dl.pyi -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/run.pyi -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/main.pyi -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/init.pyi -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/MPI.pyi -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/bench.pyi -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/py.typed -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/libmpi.pxd -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/init.pxd -> build/lib.linux-x86_64-3.10/mpi4py copying src/mpi4py/MPI.pxd -> build/lib.linux-x86_64-3.10/mpi4py creating build/lib.linux-x86_64-3.10/mpi4py/include creating build/lib.linux-x86_64-3.10/mpi4py/include/mpi4py copying src/mpi4py/include/mpi4py/mpi4py.h -> build/lib.linux-x86_64-3.10/mpi4py/include/mpi4py copying src/mpi4py/include/mpi4py/mpi4py.MPI.h -> build/lib.linux-x86_64-3.10/mpi4py/include/mpi4py copying src/mpi4py/include/mpi4py/mpi4py.MPI_api.h -> build/lib.linux-x86_64-3.10/mpi4py/include/mpi4py copying src/mpi4py/include/mpi4py/mpi4py.i -> build/lib.linux-x86_64-3.10/mpi4py/include/mpi4py copying src/mpi4py/include/mpi4py/mpi.pxi -> build/lib.linux-x86_64-3.10/mpi4py/include/mpi4py copying src/mpi4py/futures/server.pyi -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/pool.pyi -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/main.pyi -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/aplus.pyi -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/init.pyi -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/_lib.pyi -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/futures/_core.pyi -> build/lib.linux-x86_64-3.10/mpi4py/futures copying src/mpi4py/util/pkl5.pyi -> build/lib.linux-x86_64-3.10/mpi4py/util copying src/mpi4py/util/dtlib.pyi -> build/lib.linux-x86_64-3.10/mpi4py/util copying src/mpi4py/util/init.pyi -> build/lib.linux-x86_64-3.10/mpi4py/util running build_clib MPI configuration: [mpi] from 'mpi.cfg' MPI C compiler: /usr/local/mpi/bin/mpicc MPI C++ compiler: /usr/local/mpi/bin/mpicxx MPI F compiler: /usr/local/mpi/bin/mpifort MPI F90 compiler: /usr/local/mpi/bin/mpif90 MPI F77 compiler: /usr/local/mpi/bin/mpif77 checking for library 'lmpe' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -llmpe -o _configtest /usr/bin/ld: cannot find -llmpe: No such file or directory collect2: error: ld returned 1 exit status failure. removing: _configtest.c _configtest.o building 'mpe' dylib library creating build/temp.linux-x86_64-3.10 creating build/temp.linux-x86_64-3.10/src creating build/temp.linux-x86_64-3.10/src/lib-pmpi /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c src/lib-pmpi/mpe.c -o build/temp.linux-x86_64-3.10/src/lib-pmpi/mpe.o creating build/lib.linux-x86_64-3.10/mpi4py/lib-pmpi /usr/local/mpi/bin/mpicc -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 -Wl,-Bsymbolic-functions -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--no-as-needed build/temp.linux-x86_64-3.10/src/lib-pmpi/mpe.o -o build/lib.linux-x86_64-3.10/mpi4py/lib-pmpi/libmpe.so checking for library 'vt-mpi' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -lvt-mpi -o _configtest /usr/bin/ld: cannot find -lvt-mpi: No such file or directory collect2: error: ld returned 1 exit status failure. removing: _configtest.c _configtest.o checking for library 'vt.mpi' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -lvt.mpi -o _configtest /usr/bin/ld: cannot find -lvt.mpi: No such file or directory collect2: error: ld returned 1 exit status failure. removing: _configtest.c _configtest.o building 'vt' dylib library /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c src/lib-pmpi/vt.c -o build/temp.linux-x86_64-3.10/src/lib-pmpi/vt.o /usr/local/mpi/bin/mpicc -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 -Wl,-Bsymbolic-functions -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--no-as-needed build/temp.linux-x86_64-3.10/src/lib-pmpi/vt.o -o build/lib.linux-x86_64-3.10/mpi4py/lib-pmpi/libvt.so checking for library 'vt-mpi' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -lvt-mpi -o _configtest /usr/bin/ld: cannot find -lvt-mpi: No such file or directory collect2: error: ld returned 1 exit status failure. removing: _configtest.c _configtest.o checking for library 'vt.mpi' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -lvt.mpi -o _configtest /usr/bin/ld: cannot find -lvt.mpi: No such file or directory collect2: error: ld returned 1 exit status failure. removing: _configtest.c _configtest.o building 'vt-mpi' dylib library /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c src/lib-pmpi/vt-mpi.c -o build/temp.linux-x86_64-3.10/src/lib-pmpi/vt-mpi.o /usr/local/mpi/bin/mpicc -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 -Wl,-Bsymbolic-functions -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--no-as-needed build/temp.linux-x86_64-3.10/src/lib-pmpi/vt-mpi.o -o build/lib.linux-x86_64-3.10/mpi4py/lib-pmpi/libvt-mpi.so checking for library 'vt-hyb' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -lvt-hyb -o _configtest /usr/bin/ld: cannot find -lvt-hyb: No such file or directory collect2: error: ld returned 1 exit status failure. removing: _configtest.c _configtest.o checking for library 'vt.ompi' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -lvt.ompi -o _configtest /usr/bin/ld: cannot find -lvt.ompi: No such file or directory collect2: error: ld returned 1 exit status failure. removing: _configtest.c _configtest.o building 'vt-hyb' dylib library /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c src/lib-pmpi/vt-hyb.c -o build/temp.linux-x86_64-3.10/src/lib-pmpi/vt-hyb.o /usr/local/mpi/bin/mpicc -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 -Wl,-Bsymbolic-functions -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--no-as-needed build/temp.linux-x86_64-3.10/src/lib-pmpi/vt-hyb.o -o build/lib.linux-x86_64-3.10/mpi4py/lib-pmpi/libvt-hyb.so running build_ext MPI configuration: [mpi] from 'mpi.cfg' MPI C compiler: /usr/local/mpi/bin/mpicc MPI C++ compiler: /usr/local/mpi/bin/mpicxx MPI F compiler: /usr/local/mpi/bin/mpifort MPI F90 compiler: /usr/local/mpi/bin/mpif90 MPI F77 compiler: /usr/local/mpi/bin/mpif77 checking for dlopen() availability ... checking for header 'dlfcn.h' ... x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o success! removing: _configtest.c _configtest.o success! checking for library 'dl' ... x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o x86_64-linux-gnu-gcc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -ldl -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for function 'dlopen' ... x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o x86_64-linux-gnu-gcc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -ldl -o _configtest success! removing: _configtest.c _configtest.o _configtest building 'mpi4py.dl' extension x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DHAVE_DLFCN_H=1 -DHAVE_DLOPEN=1 -I/usr/include/python3.10 -c src/dynload.c -o build/temp.linux-x86_64-3.10/src/dynload.o x86_64-linux-gnu-gcc -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 -Wl,-Bsymbolic-functions -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.10/src/dynload.o -Lbuild/temp.linux-x86_64-3.10 -ldl -o build/lib.linux-x86_64-3.10/mpi4py/dl.cpython-310-x86_64-linux-gnu.so checking for MPI compile and link ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o success! removing: _configtest.c _configtest.o /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for missing MPI functions/symbols ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o success! removing: _configtest.c _configtest.o checking for function 'MPI_Type_create_f90_integer' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for function 'MPI_Type_create_f90_real' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for function 'MPI_Type_create_f90_complex' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for function 'MPI_Status_c2f' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for function 'MPI_Status_f2c' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for symbol 'MPI_LB' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for symbol 'MPI_UB' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for dlopen() availability ... checking for header 'dlfcn.h' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o success! removing: _configtest.c _configtest.o success! checking for library 'dl' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -ldl -o _configtest success! removing: _configtest.c _configtest.o _configtest checking for function 'dlopen' ... /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c _configtest.c -o _configtest.o /usr/local/mpi/bin/mpicc _configtest.o -Lbuild/temp.linux-x86_64-3.10 -ldl -o _configtest success! removing: _configtest.c _configtest.o _configtest building 'mpi4py.MPI' extension /usr/local/mpi/bin/mpicc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DHAVE_DLFCN_H=1 -DHAVE_DLOPEN=1 -I/usr/include/python3.10 -c src/MPI.c -o build/temp.linux-x86_64-3.10/src/MPI.o /usr/local/mpi/bin/mpicc -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 -Wl,-Bsymbolic-functions -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.10/src/MPI.o -Lbuild/temp.linux-x86_64-3.10 -ldl -o build/lib.linux-x86_64-3.10/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so writing build/lib.linux-x86_64-3.10/mpi4py/mpi.cfg Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in main() File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main json_out['return_val'] = hook(hook_input['kwargs']) File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel return _build_backend().build_wheel(wheel_directory, config_settings, File "/tmp/pip-build-env-_s4_k_me/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 404, in build_wheel return self._build_with_temp_dir( File "/tmp/pip-build-env-_s4_k_me/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 389, in _build_with_temp_dir self.run_setup() File "/tmp/pip-build-env-_s4_k_me/overlay/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 311, in run_setup exec(code, locals()) File "", line 644, in File "", line 641, in main File "", line 492, in run_setup File "/tmp/pip-install-v2r6okyl/mpi4py_c6cddf4bfc8c4ebd9b5de61b26253ebf/conf/mpidistutils.py", line 541, in setup return fcn_setup(attrs) File "/tmp/pip-build-env-_s4_k_me/overlay/local/lib/python3.10/dist-packages/setuptools/init.py", line 103, in setup return distutils.core.setup(**attrs) File "/usr/lib/python3.10/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/tmp/pip-build-env-_s4_k_me/overlay/local/lib/python3.10/dist-packages/setuptools/dist.py", line 963, in run_command super().run_command(command) File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/tmp/pip-build-env-_s4_k_me/overlay/local/lib/python3.10/dist-packages/wheel/bdist_wheel.py", line 370, in run install = self.reinitialize_command("install", reinit_subcommands=True) File "/tmp/pip-build-env-_s4_k_me/overlay/local/lib/python3.10/dist-packages/setuptools/init.py", line 216, in reinitialize_command cmd = _Command.reinitialize_command(self, command, reinit_subcommands) File "/usr/lib/python3.10/distutils/cmd.py", line 305, in reinitialize_command return self.distribution.reinitialize_command(command, File "/usr/lib/python3.10/distutils/dist.py", line 938, in reinitialize_command command = self.get_command_obj(command_name) File "/usr/lib/python3.10/distutils/dist.py", line 858, in get_command_obj cmd_obj = self.command_obj[command] = klass(self) File "/tmp/pip-build-env-_s4_k_me/overlay/local/lib/python3.10/dist-packages/setuptools/init.py", line 174, in init super().init(dist) File "/usr/lib/python3.10/distutils/cmd.py", line 62, in init self.initialize_options() File "/tmp/pip-build-env-_s4_k_me/overlay/local/lib/python3.10/dist-packages/setuptools/command/install.py", line 50, in initialize_options orig.install.initialize_options(self) File "/usr/lib/python3.10/_distutils_system_mod.py", line 33, in initialize_options super().initialize_options() TypeError: super(type, obj): obj must be an instance or subtype of type [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for mpi4py Successfully built mlperf-logging mlperf-common Failed to build mpi4py ERROR: Could not build wheels for mpi4py, which is required to install pyproject.toml-based projects

[notice] A new release of pip is available: 23.2.1 -> 23.3.1 [notice] To update, run: python -m pip install --upgrade pip The command '/bin/sh -c pip3 install --no-cache-dir -r requirements.txt' returned a non-zero code: 1

jndinesh commented 11 months ago

We also attempted to build DLRM using the nvcr.io/nvidia/pytorch:23.10-py3 as base image, but we were unsuccessful.

After reviewing other submissions, we attempted to use the nvcr.io/nvdlfwea/pytorch:23.09-py3](http://nvcr.io/nvdlfwea/pytorch:23.09-py3) variation submitted by supermicro, but unfortunately, we do not have access to it. Here is the URL.

Could someone assist in determining if this issue is related to base layer override? If so, could you please point us to the correct image?

jndinesh commented 11 months ago

Thanks much Shirya. Verified locally. Changes looks good.