pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
84.37k stars 22.71k forks source link

[DomainsOnly] Jobs fail with GLIBC version not found #140631

Open izaitsevfb opened 1 week ago

izaitsevfb commented 1 week ago

Current Status

ongoing

Error looks like

Run actions/checkout@v3
/__e/node20/bin/node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.[25](https://github.com/pytorch/torchtune/actions/runs/11826665430/job/32953216074#step:5:26)' not found (required by /__e/node20/bin/node)

failure example

Incident timeline (all times pacific)

started: Wed Nov 13 ≈12pm detected: Wed Nov 13 ≈3pm

User impact

Some Nova workflows may fail with the error above. Domain libraries affected torchvision, torchaudio, data, torchtune.

Root cause

Github removed Node 16 in 2.321.0 release.

FYI, a new runner release is created v2.321.0. I am going to rollout this runner slowly through all the rings. What's special about this runner release is we removed node16 from this runner package and upgrade the dotnet sdk from dotnet 6 to dotnet 8 We need to fully remove node16 from our actions runner to secure our ecosystems and upgrade .net to version 8. Both of these break support with older clib versions of linux, mainly centos7

Mitigation

ongoing

Prevention/followups

TBD

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

huydhn commented 1 week ago

I logged in to one runner i-0c25b12d27c6cfb13 and it’s on AL2003 and should support node 20. However, Nova build job uses GH container which need to be updated. The failed checkout step is done inside the container https://github.com/pytorch/test-infra/blob/main/.github/workflows/build_wheels_linux.yml#L128

malfet commented 1 week ago

I logged in to one runner i-0c25b12d27c6cfb13 and it’s on AL2003 and should support node 20. However, Nova build job uses GH container which need to be updated. The failed checkout step is done inside the container https://github.com/pytorch/test-infra/blob/main/.github/workflows/build_wheels_linux.yml#L128

Hmm, why are we running runner inside the container rather than on host OS?

atalman commented 1 week ago

Landed PyTorch PR: https://github.com/pytorch/pytorch/pull/138732 TestInfra PR: https://github.com/pytorch/test-infra/pull/5909

malfet commented 1 week ago

@atalman I don't see aarch64 failures on HUD Can you please elaborate what you are referring to?

atalman commented 1 week ago

@malfet its shows up here: https://hud2.pytorch.org/hud/pytorch/vision/nightly/1?per_page=50&mergeLF=true Fix is coming

malfet commented 1 week ago

Its shows up here: https://hud2.pytorch.org/hud/pytorch/vision/nightly/1?per_page=50&mergeLF=true Fix is coming

@atalman those are part of Nova workflows, aren't they? I.e. PyTorch CI/CD is totally unaffected by this SEV? (Changing subject to Domains only then)

q10 commented 1 week ago

We're also seeing it in our FBGEMM CI jobs - https://github.com/pytorch/FBGEMM/actions/runs/11843064011/job/33003165717#step:5:28

atalman commented 1 week ago

We're also seeing it in our FBGEMM CI jobs - pytorch/FBGEMM/actions/runs/11843064011/job/33003165717#step:5:28

HI @q10 : Fix for aarch64 failures was deployed to test-infra, latest run seems good: https://github.com/pytorch/FBGEMM/actions/runs/11843064011/job/33017996827

atalman commented 3 days ago

Failure still exist for all ROCM jobs: https://github.com/pytorch/vision/actions/runs/11891587010/job/33132656552

malfet commented 2 days ago

See bash script below that demonstrates that it's possible to propagate libc-2.31 from host OS (running Ubuntu 20.04 in my case) into the CentOS7 based docker container:

mkdir -p lib-2.31
for lib in libstdc++.so.6 libdl.so.2 libm.so.6 libpthread.so.0 libc.so.6 librt.so.1 libpthread-2.31.so  ld-2.31.so libdl-2.31.so libstdc++.so.6.0.32  libm-2.31.so libgcc_s.so.1 libpthread-2.31.so libc-2.31.so librt-2.31.so; do
  cp -a /usr/lib/x86_64-linux-gnu/$lib lib-2.31
done

docker run --rm -it -v ./lib-2.31:/lib-2.31 pytorch/manylinux-builder:cpu bash -c "curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz; tar zxf actions-runner-linux-x64-2.320.0.tar.gz; /externals/node20/bin/node  || LD_LIBRARY_PATH=/lib-2.31 /lib-2.31/ld-2.31.so /externals/node20/bin/node"

And it's output looks as follows:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  208M  100  208M    0     0   446M      0 --:--:-- --:--:-- --:--:--  446M
/externals/node20/bin/node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by /externals/node20/bin/node)
/externals/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by /externals/node20/bin/node)
Welcome to Node.js v20.13.1.
Type ".help" for more information.
> 
malfet commented 2 days ago

Though, one doesn't even need to copy anything, it's sufficient to mount the folder containing libc-2.31 and its dependencies into the container:

 docker run --rm -it -v /usr/lib/x86_64-linux-gnu/:/lib-2.31 pytorch/manylinux-builder:cpu bash -c "curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz; tar zxf actions-runner-linux-x64-2.320.0.tar.gz; /externals/node20/bin/node  || LD_LIBRARY_PATH=/lib-2.31 /lib-2.31/ld-2.31.so /externals/node20/bin/node"
liligwu commented 2 days ago

We're also seeing it in our FBGEMM CI jobs - https://github.com/pytorch/FBGEMM/actions/runs/11843064011/job/33003165717#step:5:28

In the FBGEMM CI, the base image is pytorch/manylinux-builder:rocm6.1. It has GLIBC 2.17 and does not satisfy GLIBC_2.27. Moreover, the guest OS of pytorch/manylinux-builder:rocm6.1 is CentOS 7, is it too old and could be a potential problem in the future?

atalman commented 2 days ago

@malfet I can build glibc on CentOS 7 but this does not solve the problem with Github Actions:

- name: Build and install Glibc 2.28
        if:  ${{ env.IS_MANYLINUX2_28 != 'true' }}
        shell: bash -l {0}
        run: |
          curl -L https://raw.githubusercontent.com/tj/n/master/bin/n -o n
          bash n 20
          curl -O https://ftp.gnu.org/gnu/glibc/glibc-2.28.tar.gz
          tar -zxf glibc-2.28.tar.gz
          cd glibc-2.28
          pwd
          mkdir glibc-build
          cd glibc-build
          mkdir -p /opt/glibc-2.28/etc
          touch /opt/glibc-2.28/etc/ld.so.conf
          ../configure --prefix=/opt/glibc-2.28 --disable-werror
          echo "running make -j 4"
          make -j 4 # Use all 4 Jetson Nano cores for much faster building
          echo "running make install"
          make install
          cd ..
          rm -fr glibc-2.28 glibc-2.28.tar.gz
          echo "running ldd"
          ldd /opt/glibc-2.28/lib/ld-linux-x86-64.so.2
          export LD_LIBRARY_PATH=/opt/glibc-2.28/lib:$LD_LIBRARY_PATH

Still an error: https://github.com/pytorch/test-infra/actions/runs/11927482863/job/33242922402?pr=5941#step:6:27

atalman commented 21 hours ago

Possible mitigation PR: https://github.com/pytorch/test-infra/pull/5941