google / jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
http://jax.readthedocs.io/
Apache License 2.0
28.96k stars 2.65k forks source link

[ROCM] Error: wheel file is invalid #21816

Open PhilipVinc opened 2 weeks ago

PhilipVinc commented 2 weeks ago

Description

The wheel files distributed at https://github.com/ROCm/jax/releases are invalid. See error

(jax-env) [cad14908] fvicentini@login5:~$ python3 -m pip install --verbose -U https://github.com/ROCm/jax/releases/download/rocm-jaxlib-v0.4.28/jaxlib-0.4.28+rocm611-cp311-cp311-manylinux2014_x86_64.whl
Using pip 24.0 from /lus/home/CT5/cad14908/fvicentini/jax-env/lib/python3.11/site-packages/pip (python 3.11)
Looking in indexes: https://gorgone.cines.fr//root/pypi/+simple/
Collecting jaxlib==0.4.28+rocm611
  Downloading https://github.com/ROCm/jax/releases/download/rocm-jaxlib-v0.4.28/jaxlib-0.4.28+rocm611-cp311-cp311-manylinux2014_x86_64.whl (112.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 112.7/112.7 MB 85.1 MB/s eta 0:00:00
ERROR: Wheel 'jaxlib' located at /tmp/pip-unpack-qej7pdlb/jaxlib-0.4.28+rocm611-cp311-cp311-manylinux2014_x86_64.whl is invalid.

System info (python version, jaxlib version, accelerator, etc.)

(jax-env) [cad14908] fvicentini@login5:~$ pip --version pip 24.0 from /lus/home/CT5/cad14908/fvicentini/jax-env/lib/python3.11/site-packages/pip (python 3.11) (jax-env) [cad14908] fvicentini@login5:~$ python --version Python 3.11.5

PhilipVinc commented 2 weeks ago

Downloading the file and trying to unzip it manually also fails. I think they are corrupted.

hawkinsp commented 1 week ago

@rahulbatra85 FYI.

Also, while I'm pinging you, I'll note that we're going to drop support for monolithic CUDA jaxlib wheels in the next release, in favor of plugin wheels. ROCM should switch or your build will break...

rahulbatra85 commented 1 week ago

@hawkinsp Thanks for the ping.Yeah, I saw it in the release notes. We are working on pushing out changes for ROCm PjRT plugin.

Thanks!

PhilipVinc commented 1 week ago

@rahulbatra85 if you are changing your build infrastructure, can I give another feedback?

Your wheels ARE NOT manylinux2014 compliant, even if you tag them as such! Manylinux2014 means that you should require GLIBC/GLIBCXX from 2014 (circa glibc 2.14), but instead your wheels link to to the relatively recent GLIBCXX_3.4.26 and GLIBC_2.29. (I tested the most recent working one, jaxlib-0.4.26+rocm610-cp311-cp311-manylinux2014_x86_64.whl )

This 1) means your wheels are not compliant, and 2) make it very hard to run on HPC environments.

I've been struggling for the last few months to run Jax on (France's) Cray HPC Hardware with AMD GPUs, and it's really a pain. A few releases ago you bumped those GLIBC and GLIBCXX and now it's hard to get it running at all.

mrodden commented 1 week ago

Your wheels ARE NOT manylinux2014 compliant, even if you tag them as such!

We are aware of this and are working to fix it for the JAX builds. Currently these wheels are most likely ubuntu 20.04+ compliant, since I believe that is what they are being created with.

Downloading the file and trying to unzip it manually also fails. I think they are corrupted.

It appears that only two of the 6 wheels in the release have this issue:

I have checked the others and they seem to work properly, so those could be used instead as a workaround for now.

I may need to rebuild the busted wheels and re-upload as I am not sure what caused them to become unusable.

hawkinsp commented 1 week ago

I recommend running auditwheel in CI and verifying the tag you expect is the tag you get.

PhilipVinc commented 1 week ago

Thanks! In order to plan ahead, do you have a timeline for fixing the manylinux compliance?

mrodden commented 1 week ago

Re-uploaded working versions of:

Thanks! In order to plan ahead, do you have a timeline for fixing the manylinux compliance?

I'm not sure what the ETA on that would be yet. We are doing that as part of the update to manylinux_2_28, since manylinux2014 is out of support in a couple months I believe.

I recommend running auditwheel in CI and verifying the tag you expect is the tag you get.

auditwheel is definitely part of our real manylinux builds with other frameworks, but our JAX stuff came into that work late :/ It's definitely something in the pipeline however.

hawkinsp commented 4 days ago

@PhilipVinc Out of curiosity, what glibc/libstdc++ version can you support?

We're wondering to what standard to bump JAX's main releases, and two options are: a) manylinux_2_28 (glibc 2.28, glibcxx 3.4.24) b) manylinux_2_31 (glibc 2.31, glibcxx 3.4.28).

You noted that AMD's wheels were too new for you, so I'm curious what standards you can accept and whether it's possible to upgrade. Is 2_28 possible? I think that's the likely outcome.

PhilipVinc commented 4 days ago

Hey @hawkinsp . I will double check tomorrow but IIRC the Cray HPC system with AMD GPUs (France's 2nd largest HPC) has glibc 2.28 at most.

If you want, I will double check other big European HPC clusters as well to give you some datapoints, but in general do expect them to be outdated.

I'm sorry to state that, but they are on average slow to upgrade, so it's unfortunately on you in some sense to be conservative.

hawkinsp commented 4 days ago

manylinux_2_28 roughly corresponds to a release from Aug 2018, and I'm tempted to say "6 years is a long enough support window". It also happens to be the next newest version at which the manylinux project has docker images: https://github.com/pypa/manylinux, so I'd expect wide adoption as soon as the manylinux2014 CentOS reaches end of life.

PhilipVinc commented 4 days ago

So I checked on the clusters I have access to:

From a quick chat with the support teams, it seems that the problem is that Cray is very slow to release updated version of their custom software stack, while the other vendor is less constraining on them.

I agree with you that 6 years is long enough support window. I think if you go with manylinux_2_28 it should be ok for the vast majority of users.