pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.44k stars 452 forks source link

Invalid version identifier in filenames of nightly builds #7697

Closed fellhorn closed 3 weeks ago

fellhorn commented 1 month ago

🐛 Bug

pip 24.1 deprecated legacy version identifiers and no longer allows installing the current nightly wheels directly. Other python package managers, like e.g. uv never supported these identifiers and always required renaming the wheel.

Additionally the version identifier in the wheel is different than the one in the filename.

To Reproduce

Steps to reproduce the behavior:

uv

> uv pip install torch_xla@https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
error: The wheel filename "torch_xla-nightly-cp311-cp311-linux_x86_64.whl" has an invalid version part: expected version to start with a number, but no leading ASCII digits were found

pip

Broken:

> pip install pip==24.1.2
...
> pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
ERROR: Invalid requirement: 'torch-xla==nightly': Expected end or semicolon (after name and no valid version specifier)
    torch-xla==nightly
             ^

For others potentially finding this issue and need a workaround:

:green_circle: Works with torch_xla@ format

> pip install pip==24.1.2
...
> pip install torch_xla@https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
...
Installing collected packages: torch_xla
Successfully installed torch_xla-2.5.0+git41d998d

:green_circle: Works with older pip versions

> pip install "pip<24"
...
> pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
...
Installing collected packages: torch_xla
Successfully installed torch_xla-2.5.0+git41d998d

Expected behavior

I would expect the version identifier in the filename to match the one in the wheel and be a valid identifier. This should allow installation with uv and modern pip versions.

Potential solutions

Ideas:

JackCaoG commented 1 month ago

@will-cromar FYI, @wonjoolee95 too since you are fixing the similar issue for our gpu whls

wonjoolee95 commented 1 month ago

This is helpful, thanks for the info! I'm able to reproduce:

# Fails
wonjoo@t1v-n-b72eb559-w-0:~$ pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
Defaulting to user installation because normal site-packages is not writeable
ERROR: Invalid requirement: 'torch-xla==nightly': Expected end or semicolon (after name and no valid version specifier)
    torch-xla==nightly
             ^
# Works
wonjoo@t1v-n-b72eb559-w-0:~$ pip install "pip<24"
Defaulting to user installation because normal site-packages is not writeable
Collecting pip<24
  Downloading pip-23.3.2-py3-none-any.whl.metadata (3.5 kB)
Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 34.0 MB/s eta 0:00:00
WARNING: Error parsing dependencies of distro-info: Invalid version: '1.1build1'
WARNING: Error parsing dependencies of python-debian: Invalid version: '0.1.43ubuntu1'
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
  WARNING: The scripts pip, pip3 and pip3.10 are installed in '/home/wonjoo/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pip-23.3.2

I think it's better if we do pip install "pip<24" to fix our GPU wheels asap, and then come up with a more long term solution. @will-cromar, do you know where would be the correct place to have this pip install "pip<24" command in our /infra files?

will-cromar commented 1 month ago

Is this issue actually what's causing our build breakage? Why are the TPU builds passing but not the GPU builds? The most recent failures I see there are this:

Step #2 - "build_xla_docker_image":     ERROR: An error occurred during the fetch of repository 'go_sdk':
Step #2 - "build_xla_docker_image":        Traceback (most recent call last):
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 101, column 16, in _go_download_sdk_impl
Step #2 - "build_xla_docker_image":                     _remote_sdk(ctx, [url.format(filename) for url in ctx.attr.urls], ctx.attr.strip_prefix, sha256)
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 209, column 21, in _remote_sdk
Step #2 - "build_xla_docker_image":                     ctx.download(
Step #2 - "build_xla_docker_image":     Error in download: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.4.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/go_sdk/go_sdk.tar.gz: Bytes read 127925296 but wanted 141812725
Step #2 - "build_xla_docker_image":     ERROR: /src/pytorch/xla/WORKSPACE:136:15: fetching _go_download_sdk rule //external:go_sdk: Traceback (most recent call last):
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 101, column 16, in _go_download_sdk_impl
Step #2 - "build_xla_docker_image":                     _remote_sdk(ctx, [url.format(filename) for url in ctx.attr.urls], ctx.attr.strip_prefix, sha256)
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 209, column 21, in _remote_sdk
Step #2 - "build_xla_docker_image":                     ctx.download(
Step #2 - "build_xla_docker_image":     Error in download: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.4.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/go_sdk/go_sdk.tar.gz: Bytes read 127925296 but wanted 141812725
Step #2 - "build_xla_docker_image":     ERROR: Analysis of target '//:_XLAC.so' failed; build aborted: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.4.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/go_sdk/go_sdk.tar.gz: Bytes read 127925296 but wanted 141812725

Even if we can hack our build, this is a client issue. Nobody who updated their pip recently would be able to install our wheels, because the rename we're doing is no longer actually valid.

The build version we set is defined by some combination of these environment variables: https://github.com/pytorch/xla/blob/master/infra/ansible/config/env.yaml

I think TORCH_XLA_VERSION and GIT_VERSIONED_XLA_BUILD are the important ones, but you'll have to review setup.py to see how we set version exactly. That version name is probably still valid like torch_xla-2.5.0+git41d998d. The problem is, we rename the wheels with the nightly date here: https://github.com/pytorch/xla/blob/44f88a9d6135abe5cbb533485b40e19d11b88b23/infra/ansible/roles/build_srcs/tasks/main.yaml#L74-L89

We need to at least change that rename to one of the valid patterns like @fellhorn suggested or copy the pattern used by torch (e.g. torch-X.Y.Z.devYYYYMMDD)

You can dry run the ansible workflow with a command like this one:

https://github.com/pytorch/xla/blob/44f88a9d6135abe5cbb533485b40e19d11b88b23/.github/workflows/_build_torch_xla.yml#L50

Anything that gets written to /dist is what we will upload to GCS.

JackCaoG commented 1 month ago

@zpcore can you made the rename logic that @will-cromar mentioned above since you are offcall this week? It should just be a one line change but then we need to update README to reflect the new format.

mfatih7 commented 1 month ago

Hello all

As a general comment:

When users find errors in pytorch-xla developers fix it in nightly releases and ask the users to test them. But generating an environment with compatible torch-xla, torch, and torch vision is not straightforward as told here.

This issue is one example of it. I hope you provide a better way for users to test the nightly updates easily.

zpcore commented 1 month ago

Hello all

As a general comment:

When users find errors in pytorch-xla developers fix it in nightly releases and ask the users to test them. But generating an environment with compatible torch-xla, torch, and torch vision is not straightforward as told here.

This issue is one example of it. I hope you provide a better way for users to test the nightly updates easily.

Thanks for the feedback, I think we are missing to provide example commands to install compatible torch, torch[vision,audio], torch_xla for the cuda. We will make the document update. For now, you can use e.g.,:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp310-cp310-linux_x86_64.whl

In general, this should be compatible.

mfatih7 commented 4 weeks ago

@zpcore

Thank you for the answer

I could explore your answer now. The lines you provide are for CUDA. I was trying to generate an environment with nightly releases of torch, torch(vision, audio), and torch_xla on a TPU VM.

mfatih7 commented 4 weeks ago

I think

The updated lines in the main page

pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html

are OK. Now I get the output below with pip list.

...
tomli                        2.0.1
torch                        2.5.0.dev20240809+cpu
torch-xla                    2.5.0+git9fbd64a
torchmetrics                 1.4.1
torchsummary                 1.5.1
torchvision                  0.20.0.dev20240809+cpu
traitlets                    5.14.3
...