huggingface / accelerate

πŸš€ A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.32k stars 872 forks source link

Hotfix PyTorch Version Installation in CI Workflow for Minimum Version Matrix #2889

Open yhna940 opened 1 week ago

yhna940 commented 1 week ago

What does this PR do?

ci-workflow

This PR corrects the CI workflow to ensure that the minimum specified PyTorch version (1.10.0) is installed when the matrix.pytorch-version is set to minimum. Previously, the installation condition incorrectly referenced matrix.test-kind, leading to the installation of the latest PyTorch version instead.

Changes Made

Before

if [[ ${{ matrix.test-kind }} = minimum ]]; then pip install torch==1.10.0; fi

After

if [[ ${{ matrix.pytorch-version }} = minimum ]]; then pip install torch==1.10.0; fi

Motivation and Context

This change ensures that the CI workflow correctly installs PyTorch 1.10.0 when the matrix.pytorch-version is set to minimum. This avoids any inconsistencies and ensures that tests are run against the intended version of PyTorch.

Related Issues

N/A

Before submitting

HuggingFaceDocBuilderDev commented 1 week ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan commented 1 week ago

I was afraid that something like this would happen: Since CI did not correctly install the old PyTorch version, there seems to be some code that indeed fails with the CI because it requires a more recent version. So for this fix to pass, we would also have to fix all the failing tests (which may or may not be easy).

However, I wonder if we really want to support PyTorch 1.10.0, which is from Oct 2021. When I talked to PyTorch devs, their opinion was that supporting the last 4 versions (not counting patch releases) is reasonable, so that would mean setting the min version to 2.0.1. Probably we want to be a bit more generous here but the last change to the min version was one year ago, so I think we can bump it. Depending on the chosen version, this means way may not have to fix any tests after all.

Let's wait for Zach's return to decide this.

yhna940 commented 1 week ago

Thank you for the review!

I understand the concern regarding the PyTorch version support. Since Zach's input is pending, I'm happy to wait for his review.

However, I would like to emphasize that this PR specifically addresses the issue of the pytorch-version not being correctly referenced in the CI configuration. It ensures that the specified minimum version is actually installed.

Given that this PR is focused on fixing the reference issue, I suggest merging it with the installation set to torch==2.X (the minimum version that allows the CI to pass). This way, we can ensure the CI pipeline is correctly referencing the pytorch-version.

Subsequently, huggingface maintainers can decide on the appropriate minimum PyTorch version policy internally. They can also address any failing tests and gradually adjust the CI minimum version as needed. Modifying the test cases within this PR would make it significantly larger and more complex, so I propose handling those changes separately.

Looking forward to your thoughts and Zach's input.

if [[ ${{ matrix.pytorch-version }} = minimum ]]; then pip install torch==2.x; fi
BenjaminBossan commented 6 days ago

Given that this PR is focused on fixing the reference issue, I suggest merging it with the installation set to torch==2.X (the minimum version that allows the CI to pass). This way, we can ensure the CI pipeline is correctly referencing the pytorch-version.

This could be a good compromise and would be a strict improvement on the current status. Do you know what that version would be? If not, we could just try 2.0. Regardless though, this PR would need to wait for Zach's approval to be merged, so a bit of waiting is inevitable.

Modifying the test cases within this PR would make it significantly larger and more complex, so I propose handling those changes separately.

I agree.

yhna940 commented 1 day ago

Do you know what that version would be? If not, we could just try 2.0. Regardless though, this PR would need to wait for Zach's approval to be merged, so a bit of waiting is inevitable.

It seems that CI only works on the latest version of torch. I've tried v2.2.2 and v2.1.2 with the make test command and got the following error messages.

FAILED tests/test_grad_sync.py::SyncScheduler::test_gradient_sync_gpu_multi - RuntimeError: 'accelerate launch --num_processes=8 --monitor_interval=0.1 /workspaces/accelerate/src/accelerate/test_utils/scripts/test_sync.py' failed with returncode 1
FAILED tests/test_multigpu.py::MultiDeviceTester::test_multi_device_merge_fsdp_weights - RuntimeError: 'accelerate launch --num_processes=8 --monitor_interval=0.1 /workspaces/accelerate/src/accelerate/test_utils/scripts/test_merge_weights.py' failed with returncode 1
FAILED tests/test_multigpu.py::MultiDeviceTester::test_pippy - RuntimeError: 'accelerate launch --multi_gpu --num_processes=8 /workspaces/accelerate/src/accelerate/test_utils/scripts/external_deps/test_pippy.py' failed with returncode 1
FAILED tests/test_utils.py::UtilsTester::test_dynamo - torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
muellerzr commented 1 day ago

Thanks for your patience while I was out on holiday! Our general rule of thumb with Accelerate to keep things stable is the last 2 years of PyTorch releases, which for us right now is 1.12.0.

We should definitely fix whatever is broken, thank you for pointing out that minimum isn't doing it's job!

If you feel up to working on fixing all these issues, or tackling a few of them, that'd be most welcome. The other solution is to merge this, since we know this "fixes" the CI, and I can get to getting these passing ASAP.

Which are you more open to contribute to? πŸ€—

yhna940 commented 1 day ago

Thank you for getting back to me, and I hope you had a wonderful holiday!

For this PR, I suggest merging it as it corrects the issue with the pytorch-version reference in the CI workflow. Once this is addressed, we can move forward with fixing any broken CI issues separately.

I'm open to contributing further by addressing specific failing tests or issues that come up.

Looking forward to your thoughts on this plan :)

BenjaminBossan commented 22 hours ago

For this PR, I suggest merging it as it corrects the issue with the pytorch-version reference in the CI workflow. Once this is addressed, we can move forward with fixing any broken CI issues separately.

We can't really merge the PR as is, since it would mean that CI is broken until all the issues with older PyTorch versions have been fixed. Instead, we should make this change, right?

- if [[ ${{ matrix.pytorch-version }} = minimum ]]; then pip install torch==1.10.0; fi
+ if [[ ${{ matrix.pytorch-version }} = minimum ]]; then pip install torch==2.3.1; fi

This way, all the tests pass and we're basically testing the same thing as we already do in accelerate. When PyTorch 2.4 is released, we also make sure that 2.3 keeps on working. Next, we can slowly try to move the version down to 1.12 while fixing the issues that come up.

yhna940 commented 22 hours ago
- if [[ ${{ matrix.pytorch-version }} = minimum ]]; then pip install torch==1.10.0; fi
+ if [[ ${{ matrix.pytorch-version }} = minimum ]]; then pip install torch==2.3.1; fi

Done 5709c05 πŸ™