[MNT] handle `mps backend` for lower versions of pytorch and fix `mps` failure on `macOS-latest` runner

fnhirwa commented 2 months ago

Description

This PR handles the issue that may result when setting the device to mps if the torch version doesn't support mps backend

Depends on https://github.com/jdb78/pytorch-forecasting/pull/1633

I used pytest.MonkeyPatch() to disable the discovery of the mps accelerator. The tests run on CPU for macOS-latest

fixes https://github.com/jdb78/pytorch-forecasting/issues/1596

codecov-commenter commented 2 months ago

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Project coverage is 90.03%. Comparing base (e2bf3fd) to head (f3c64ab).

Files with missing lines	Patch %	Lines
pytorch_forecasting/utils/_utils.py	0.00%	7 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1648 +/- ## ========================================== - Coverage 90.14% 90.03% -0.12% ========================================== Files 32 32 Lines 4780 4786 +6 ========================================== Hits 4309 4309 - Misses 471 477 +6 ``` | [Flag](https://app.codecov.io/gh/jdb78/pytorch-forecasting/pull/1648/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jan+Beitner) | Coverage Δ | | |---|---|---| | [cpu](https://app.codecov.io/gh/jdb78/pytorch-forecasting/pull/1648/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jan+Beitner) | `90.03% <0.00%> (-0.12%)` | :arrow_down: | | [pytest](https://app.codecov.io/gh/jdb78/pytorch-forecasting/pull/1648/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jan+Beitner) | `90.03% <0.00%> (-0.12%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jan+Beitner#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

XinyuWuu commented 2 months ago

Great, I thought we have to patch the binary file to overwite torch._C._mps_is_available but MonkeyPatch is a much better solution.

XinyuWuu commented 2 months ago

Can we put the patch in a global fixture? Then we don't have to request it manually for every test.

# put it in conftest.py
import pytest

@pytest.fixture(autouse=True)
def no_mps(monkeypatch):
    """replace torch._C._mps_is_available for all tests."""
    monkeypatch.setattr("torch._C._mps_is_available", lambda: False)

XinyuWuu commented 2 months ago

ok, this seems to be working, but I do not understand entirely what you are doing here and why...

Can you kindly explain?

I think it's replacing torch._C._mps_is_available function to disable MPS. torch._C is a binary library.

fnhirwa commented 2 months ago

ok, this seems to be working, but I do not understand entirely what you are doing here and why...

Can you kindly explain?

well, the mps backend uses the torch._C._mps_is_available function, patching this function gives the ability to force torch.backends.mps.is_available to return False as it generally calls _mps_is_available from the binary torch._C

For macOS now with these changes we run tests on cpu only and accelerator is disabled

fnhirwa commented 2 months ago

Can we put the patch in a global fixture? Then we don't have to request it manually for every test.
# put it in conftest.py
import pytest

@pytest.fixture(autouse=True)
def no_mps(monkeypatch):
    """replace torch._C._mps_is_available for all tests."""
    monkeypatch.setattr("torch._C._mps_is_available", lambda: False)

This is a great solution, making it a global fixture makes more sense.

fkiraly commented 2 months ago

Question: does the user on macos has the option to force the backend at runtime?

That is, do the tests faithfully reflect a situation that the user can establish on mac?

Because if the user cannot set the flag or sth equivalent to it, then our tests will pass, but the user cannot make use of the package. This might sound "trivial" but I just want to emphasize this again, each test should "certify" for a run case on the user's computer.

More specific question: let's take an actual mac similar to the setup of macos-latest. Can you give me a very concrete condition or preparation code that shows:

how the user would run the code on conditions prior to this PR and encounter the error
how the user would run the code on the conditions after this PR and not encounter the error

If this is something to configure for the user, furthermore, we need to explain it in the documentation as workaround.

fnhirwa commented 2 months ago

Question: does the user on macos has the option to force the backend at runtime?

That is, do the tests faithfully reflect a situation that the user can establish on mac?

Because if the user cannot set the flag or sth equivalent to it, then our tests will pass, but the user cannot make use of the package. This might sound "trivial" but I just want to emphasize this again, each test should "certify" for a run case on the user's computer.

More specific question: let's take an actual mac similar to the setup of macos-latest. Can you give me a very concrete condition or preparation code that shows:

how the user would run the code on conditions prior to this PR and encounter the error

how the user would run the code on the conditions after this PR and not encounter the error

If this is something to configure for the user, furthermore, we need to explain it in the documentation as workaround.

Running the code doesn't fail locally, as I am using macos the main issue is with the github macOS runner.

The tests are failing upstream due to lack of Nested-virtualization and Metal Performance Shaders (MPS) as noted here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners

I would say we can test back with MPS accelerator once this is solved.

fkiraly commented 2 months ago

What I mean, we are patching only the tests, so it looks to me like it still would fail for the user on such a system?

fnhirwa commented 2 months ago

What I mean, we are patching only the tests, so it looks to me like it still would fail for the user on such a system?

yeah we are only patching the tests, this code doesn't pose any global changes, any modifications made have lifetime of that test being run, once done the system rolls back to normal setup.

fkiraly commented 2 months ago

May I revert to my original question - could you kindly be precise which usage condition we are now testing, and what user sided setup the following two correspond to:

pre-PR condition under macos-latest
post-PR condition under macos-13 or macos-latest

fnhirwa commented 2 months ago

May I revert to my original question - could you kindly be precise which usage condition we are now testing, and what user sided setup the following two correspond to:

pre-PR condition under macos-latest

post-PR condition under macos-13 or macos-latest

On macOS:

pre-PR: under macOS-latest the tests were failing on CI due to a lack of Nested-virtualization and Metal Performance Shaders (MPS), but locally the tests pass successfully when we add PYTORCH_ENABLE_MPS_FALLBACK=1 env variable and the user can use both cpu and mps when the mentioned env variable is set.
post-PR on either macOS-13 and macOS-latest the mps accelerator is disabled when testing, but the user can use the mps accelerator when training. When needs mps based testing we have to comment out this introduced fixture and add PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to enable the torch fallback from mps to cpu for non-implemented function on mps. With that setup, you can run tests locally having mps as an accelerator.

fkiraly commented 2 months ago

What's the way a user would enable mps or disable mps? Is this hard-coded, or can this be set somewhere?

Depending on whether this is possible or not, we should make sure we give clear pointers in documentation, or set reasonable defaults, or raise informative exceptions.

fnhirwa commented 2 months ago

What's the way a user would enable mps or disable mps? Is this hard-coded, or can this be set somewhere?

Depending on whether this is possible or not, we should make sure we give clear pointers in documentation, or set reasonable defaults, or raise informative exceptions.

A user currently is not disabling mps it is enabled if it is present by default. What I think of now is adding to the documentation the guide for using mps accelerator and the need for PYTORCH_ENABLE_MPS_FALLBACK=1 env variable when using mps.

Giving users the device control whether to enable or disable mps is possible we can create a mock function that receives a parameter whether to enable it or not and then mock the function torch._C._mps_is_available to return False. With this, we will need to specify the usage and need for that function.

fkiraly commented 2 months ago

A user currently is not disabling mps it is enabled if it is present by default.

which layer does this - torch or pytorch-forecasting?
if there is defaulting depending on available or not, why is it failing on macos-latest

Giving users the device control whether to enable or disable mps is possible

Is this only possible with the patch, is there not a setting in one of the layers? If not, sounds risky as it is not using a public API.

fnhirwa commented 2 months ago

A user currently is not disabling mps it is enabled if it is present by default.

which layer does this - torch or pytorch-forecasting?

if there is defaulting depending on available or not, why is it failing on macos-latest

This is on the torch layer and the reason why it is failing for the on macos-latest is because when installing torch on macos the build has mps backend built as it doesn't install cpu only, and as specified in github actions docs https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners

Giving users the device control whether to enable or disable mps is possible

Is this only possible with the patch, is there not a setting in one of the layers? If not, sounds risky as it is not using a public API.

Yeah I agree with this there is no setting in one of the layers for this. Better is explaining the possible challenges that may raise when using mps including failures for some functions that aren't yet added to torch for mps and the reason why using PYTORCH_ENABLE_MPS_FALLBACK=1 is advisable for such cases.

fkiraly commented 2 months ago

Thanks for your explanations, I'm still not getting a good grip of the situation yet, unfortunately.

May I kindly ask you to take some time to explain:

the precise condition on the VM or local machine that may or may not be satisfied here in terms of mps. What makes a machine or VM exactly "supporting mps", and how can I determine whether it does? Optimally, with a link to docs.
the precise condition determines in torch whether mps is used or not, optimally with a link to docs.
- whether a user can select this, e.g., to force use of cpu, on the torch layer, if yes how.
the precise mechanism or location by which this is determined in pytorch-forecasting

XinyuWuu commented 2 months ago

I do not think a regular user will have the error.

It's cased by lack of Nested-virtualization from Apple MacOS. To my understanding, nested-virtualization is some thing that you have a layer of virtualization to provide virtual hardware and the another layer of virtualization to provide hardware and kernel isolated software such as a Hyper-V isolated container or VM. So if a user have a container on a local MacOS, there should be no problem. If a user have a regular container with shared hardware and kernel inside a hypervisor isolated VM, there should be no problem either. If a user have a hypervisor isolated VM or container inside another hypervisor isolated VM or container, there will be a problem. For Github runners, I guess Microsoft are having a first layer of hypervisor to build the cloud (Linux KVM or Microsoft Hyper-V based VMs), then another layer of hypervisor to provide hardware and kernel isolated MacOS VMs.

Reference for Github runner problem: https://github.com/jdb78/pytorch-forecasting/issues/1596#issuecomment-2330768511

some links about Nested-virtualization: https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/nested-virtualization https://cloud.google.com/compute/docs/instances/nested-virtualization/overview

fnhirwa commented 2 months ago

Adding to what @XinyuWuu said above, there a tracker issue for torch layer operations that aren't yet covered on mps accelerators https://github.com/pytorch/pytorch/issues/77764 which forces user to use the fallback env variable in case some ops should be required in their torch implementation.

fkiraly commented 2 months ago

Adding to what @XinyuWuu said above, there a tracker issue for torch layer operations that aren't yet covered on mps accelerators pytorch/pytorch#77764

Issue with the most participants I've seen so far

fkiraly commented 2 months ago

I do not think a regular user will have the error.

If this is the case, the my questions are:

in what precise respect does the VM differ fom the "regular user" condition
what is the "regular user condition", more generally
how and why can we be sure that the tests cover the regular user condition, post PR?

fnhirwa commented 2 months ago

I do not think a regular user will have the error.

If this is the case, the my questions are:

in what precise respect does the VM differ fom the "regular user" condition

In general for the VM the difference from a regular user is the mps support that isn't available on vm as mentioned in limitations in github actions docs here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners.

This implies that anything running on these runners will be run on cpu

what is the "regular user condition", more generally

For general users, the arm64 macOS has support for mps and they can use both the cpu and mps as accelerators.

how and why can we be sure that the tests cover the regular user condition, post PR?

I believe that post PR the tests will be forced to run on cpu. This makes it understandable that we can set this as a limitation for arm64 macOS and we can specify that we don't thorough test on mps also as I think with the current setup we don't generally test on any other accelerators other than cpu on other runners.

fkiraly commented 2 months ago

Oh, I see. May I repeat back to check whether I understood, it would be appreciated if you could check my understanding, @fnhirwa:

regular user systems of Mac users will have "MPS" as their standard backend, used by default. They would have to manually deactivate or override to use cpu backend. They will also have no problems running torch on the default.
the GH VM do not provide MPS support. Yet torch tries to use it as default on mac systems, and fails, as not available.
we can patch torch to force using cpu even in a case where it thinks mps should be present. That is what this PR does.
Therefore, we will be testing the condition where the OS is Mac, but no MPS backend is found available.

Is this correct?

If it is, I have further questions:

what exactly differs between macos-13 and macos-latest (= macos-14) runners here? Why does macos-13 not crash without the patch, while macos-14 does? Nothing in the runners page seems to indicate a clear differenc to me.
is it correct that, whether we patch or not, wehtehr we use macos-13 or macos-14, we will always be testing on cpu backend?

XinyuWuu commented 2 months ago

what exactly differs between macos-13 and macos-latest (= macos-14) runners here? Why does macos-13 not crash without the patch, while macos-14 does? Nothing in the runners page seems to indicate a clear differenc to me.

macos-13 is designed for intel chips. torch._C._mps_is_available will return false in macos-13 even without the patch.

is it correct that, whether we patch or not, wehtehr we use macos-13 or macos-14, we will always be testing on cpu backend?

Yes.

fnhirwa commented 2 months ago

Oh, I see. May I repeat back to check whether I understood, it would be appreciated if you could check my understanding, @fnhirwa:

regular user systems of Mac users will have "MPS" as their standard backend, used by default. They would have to manually deactivate or override to use cpu backend. They will also have no problems running torch on the default.

the GH VM do not provide MPS support. Yet torch tries to use it as default on mac systems, and fails, as not available.

we can patch torch to force using cpu even in a case where it thinks mps should be present. That is what this PR does.

Therefore, we will be testing the condition where the OS is Mac, but no MPS backend is found available.

Is this correct?

Yes, this is correct.

If it is, I have further questions:

what exactly differs between macos-13 and macos-latest (= macos-14) runners here? Why does macos-13 not crash without the patch, while macos-14 does? Nothing in the runners page seems to indicate a clear differenc to me.

is it correct that, whether we patch or not, wehtehr we use macos-13 or macos-14, we will always be testing on cpu backend?

As @XinyuWuu specified we will always be testing on cpu.

sktime / pytorch-forecasting

[MNT] handle `mps backend` for lower versions of pytorch and fix `mps` failure on `macOS-latest` runner #1648

Description

Codecov Report