Open izaitsevfb opened 1 week ago
I logged in to one runner i-0c25b12d27c6cfb13 and it’s on AL2003 and should support node 20. However, Nova build job uses GH container which need to be updated. The failed checkout step is done inside the container https://github.com/pytorch/test-infra/blob/main/.github/workflows/build_wheels_linux.yml#L128
I logged in to one runner i-0c25b12d27c6cfb13 and it’s on AL2003 and should support node 20. However, Nova build job uses GH container which need to be updated. The failed checkout step is done inside the container https://github.com/pytorch/test-infra/blob/main/.github/workflows/build_wheels_linux.yml#L128
Hmm, why are we running runner inside the container rather than on host OS?
Landed PyTorch PR: https://github.com/pytorch/pytorch/pull/138732 TestInfra PR: https://github.com/pytorch/test-infra/pull/5909
@atalman I don't see aarch64 failures on HUD Can you please elaborate what you are referring to?
@malfet its shows up here: https://hud2.pytorch.org/hud/pytorch/vision/nightly/1?per_page=50&mergeLF=true Fix is coming
Its shows up here: https://hud2.pytorch.org/hud/pytorch/vision/nightly/1?per_page=50&mergeLF=true Fix is coming
@atalman those are part of Nova workflows, aren't they? I.e. PyTorch CI/CD is totally unaffected by this SEV? (Changing subject to Domains only
then)
We're also seeing it in our FBGEMM CI jobs - https://github.com/pytorch/FBGEMM/actions/runs/11843064011/job/33003165717#step:5:28
We're also seeing it in our FBGEMM CI jobs - pytorch/FBGEMM/actions/runs/11843064011/job/33003165717#step:5:28
HI @q10 : Fix for aarch64 failures was deployed to test-infra, latest run seems good: https://github.com/pytorch/FBGEMM/actions/runs/11843064011/job/33017996827
Failure still exist for all ROCM jobs: https://github.com/pytorch/vision/actions/runs/11891587010/job/33132656552
See bash script below that demonstrates that it's possible to propagate libc-2.31 from host OS (running Ubuntu 20.04 in my case) into the CentOS7 based docker container:
mkdir -p lib-2.31
for lib in libstdc++.so.6 libdl.so.2 libm.so.6 libpthread.so.0 libc.so.6 librt.so.1 libpthread-2.31.so ld-2.31.so libdl-2.31.so libstdc++.so.6.0.32 libm-2.31.so libgcc_s.so.1 libpthread-2.31.so libc-2.31.so librt-2.31.so; do
cp -a /usr/lib/x86_64-linux-gnu/$lib lib-2.31
done
docker run --rm -it -v ./lib-2.31:/lib-2.31 pytorch/manylinux-builder:cpu bash -c "curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz; tar zxf actions-runner-linux-x64-2.320.0.tar.gz; /externals/node20/bin/node || LD_LIBRARY_PATH=/lib-2.31 /lib-2.31/ld-2.31.so /externals/node20/bin/node"
And it's output looks as follows:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 208M 100 208M 0 0 446M 0 --:--:-- --:--:-- --:--:-- 446M
/externals/node20/bin/node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by /externals/node20/bin/node)
Welcome to Node.js v20.13.1.
Type ".help" for more information.
>
Though, one doesn't even need to copy anything, it's sufficient to mount the folder containing libc-2.31 and its dependencies into the container:
docker run --rm -it -v /usr/lib/x86_64-linux-gnu/:/lib-2.31 pytorch/manylinux-builder:cpu bash -c "curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz; tar zxf actions-runner-linux-x64-2.320.0.tar.gz; /externals/node20/bin/node || LD_LIBRARY_PATH=/lib-2.31 /lib-2.31/ld-2.31.so /externals/node20/bin/node"
We're also seeing it in our FBGEMM CI jobs - https://github.com/pytorch/FBGEMM/actions/runs/11843064011/job/33003165717#step:5:28
In the FBGEMM CI, the base image is pytorch/manylinux-builder:rocm6.1
. It has GLIBC 2.17 and does not satisfy GLIBC_2.27. Moreover, the guest OS of pytorch/manylinux-builder:rocm6.1
is CentOS 7, is it too old and could be a potential problem in the future?
@malfet I can build glibc on CentOS 7 but this does not solve the problem with Github Actions:
- name: Build and install Glibc 2.28
if: ${{ env.IS_MANYLINUX2_28 != 'true' }}
shell: bash -l {0}
run: |
curl -L https://raw.githubusercontent.com/tj/n/master/bin/n -o n
bash n 20
curl -O https://ftp.gnu.org/gnu/glibc/glibc-2.28.tar.gz
tar -zxf glibc-2.28.tar.gz
cd glibc-2.28
pwd
mkdir glibc-build
cd glibc-build
mkdir -p /opt/glibc-2.28/etc
touch /opt/glibc-2.28/etc/ld.so.conf
../configure --prefix=/opt/glibc-2.28 --disable-werror
echo "running make -j 4"
make -j 4 # Use all 4 Jetson Nano cores for much faster building
echo "running make install"
make install
cd ..
rm -fr glibc-2.28 glibc-2.28.tar.gz
echo "running ldd"
ldd /opt/glibc-2.28/lib/ld-linux-x86-64.so.2
export LD_LIBRARY_PATH=/opt/glibc-2.28/lib:$LD_LIBRARY_PATH
Still an error: https://github.com/pytorch/test-infra/actions/runs/11927482863/job/33242922402?pr=5941#step:6:27
Possible mitigation PR: https://github.com/pytorch/test-infra/pull/5941
Current Status
ongoing
Error looks like
failure example
Incident timeline (all times pacific)
started: Wed Nov 13 ≈12pm detected: Wed Nov 13 ≈3pm
User impact
Some Nova workflows may fail with the error above. Domain libraries affected torchvision, torchaudio, data, torchtune.
Root cause
Github removed Node 16 in 2.321.0 release.
Mitigation
ongoing
Prevention/followups
TBD
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd