Closed fnhirwa closed 2 months ago
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Attention: Patch coverage is 0%
with 7 lines
in your changes missing coverage. Please review.
Project coverage is 90.03%. Comparing base (
e2bf3fd
) to head (f3c64ab
).
Files with missing lines | Patch % | Lines |
---|---|---|
pytorch_forecasting/utils/_utils.py | 0.00% | 7 Missing :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Great, I thought we have to patch the binary file to overwite torch._C._mps_is_available
but MonkeyPatch
is a much better solution.
Can we put the patch in a global fixture? Then we don't have to request it manually for every test.
# put it in conftest.py
import pytest
@pytest.fixture(autouse=True)
def no_mps(monkeypatch):
"""replace torch._C._mps_is_available for all tests."""
monkeypatch.setattr("torch._C._mps_is_available", lambda: False)
ok, this seems to be working, but I do not understand entirely what you are doing here and why...
Can you kindly explain?
I think it's replacing torch._C._mps_is_available
function to disable MPS. torch._C
is a binary library.
ok, this seems to be working, but I do not understand entirely what you are doing here and why...
Can you kindly explain?
well, the mps
backend uses the torch._C._mps_is_available
function, patching this function gives the ability to force torch.backends.mps.is_available
to return False
as it generally calls _mps_is_available
from the binary torch._C
For macOS
now with these changes we run tests on cpu
only and accelerator is disabled
Can we put the patch in a global fixture? Then we don't have to request it manually for every test.
# put it in conftest.py import pytest @pytest.fixture(autouse=True) def no_mps(monkeypatch): """replace torch._C._mps_is_available for all tests.""" monkeypatch.setattr("torch._C._mps_is_available", lambda: False)
This is a great solution, making it a global fixture makes more sense.
Question: does the user on macos has the option to force the backend at runtime?
That is, do the tests faithfully reflect a situation that the user can establish on mac?
Because if the user cannot set the flag or sth equivalent to it, then our tests will pass, but the user cannot make use of the package. This might sound "trivial" but I just want to emphasize this again, each test should "certify" for a run case on the user's computer.
More specific question: let's take an actual mac similar to the setup of macos-latest
. Can you give me a very concrete condition or preparation code that shows:
If this is something to configure for the user, furthermore, we need to explain it in the documentation as workaround.
Question: does the user on macos has the option to force the backend at runtime?
That is, do the tests faithfully reflect a situation that the user can establish on mac?
Because if the user cannot set the flag or sth equivalent to it, then our tests will pass, but the user cannot make use of the package. This might sound "trivial" but I just want to emphasize this again, each test should "certify" for a run case on the user's computer.
More specific question: let's take an actual mac similar to the setup of
macos-latest
. Can you give me a very concrete condition or preparation code that shows:
- how the user would run the code on conditions prior to this PR and encounter the error
- how the user would run the code on the conditions after this PR and not encounter the error
If this is something to configure for the user, furthermore, we need to explain it in the documentation as workaround.
Running the code doesn't fail locally, as I am using macos
the main issue is with the github macOS runner.
The tests are failing upstream due to lack of Nested-virtualization and Metal Performance Shaders (MPS) as noted here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners
I would say we can test back with MPS accelerator once this is solved.
What I mean, we are patching only the tests, so it looks to me like it still would fail for the user on such a system?
What I mean, we are patching only the tests, so it looks to me like it still would fail for the user on such a system?
yeah we are only patching the tests, this code doesn't pose any global changes, any modifications made have lifetime of that test being run, once done the system rolls back to normal setup.
May I revert to my original question - could you kindly be precise which usage condition we are now testing, and what user sided setup the following two correspond to:
macos-latest
macos-13
or macos-latest
May I revert to my original question - could you kindly be precise which usage condition we are now testing, and what user sided setup the following two correspond to:
- pre-PR condition under
macos-latest
- post-PR condition under
macos-13
ormacos-latest
On macOS:
pre-PR: under macOS-latest
the tests were failing on CI due to a lack of Nested-virtualization and Metal Performance Shaders (MPS), but locally the tests pass successfully when we add PYTORCH_ENABLE_MPS_FALLBACK=1
env variable and the user can use both cpu
and mps
when the mentioned env variable is set.
post-PR on either macOS-13
and macOS-latest
the mps
accelerator is disabled when testing, but the user can use the mps
accelerator when training. When needs mps
based testing we have to comment out this introduced fixture and add PYTORCH_ENABLE_MPS_FALLBACK=1
env variable to enable the torch
fallback from mps
to cpu
for non-implemented function on mps
. With that setup, you can run tests locally having mps
as an accelerator.
What's the way a user would enable mps
or disable mps
? Is this hard-coded, or can this be set somewhere?
Depending on whether this is possible or not, we should make sure we give clear pointers in documentation, or set reasonable defaults, or raise informative exceptions.
What's the way a user would enable
mps
or disablemps
? Is this hard-coded, or can this be set somewhere?Depending on whether this is possible or not, we should make sure we give clear pointers in documentation, or set reasonable defaults, or raise informative exceptions.
A user currently is not disabling mps
it is enabled if it is present by default. What I think of now is adding to the documentation the guide for using mps
accelerator and the need for PYTORCH_ENABLE_MPS_FALLBACK=1
env variable when using mps
.
Giving users the device control whether to enable or disable mps
is possible we can create a mock function that receives a parameter whether to enable it or not and then mock the function torch._C._mps_is_available
to return False. With this, we will need to specify the usage and need for that function.
A user currently is not disabling
mps
it is enabled if it is present by default.
torch
or pytorch-forecasting
?macos-latest
Giving users the device control whether to enable or disable
mps
is possible
Is this only possible with the patch, is there not a setting in one of the layers? If not, sounds risky as it is not using a public API.
A user currently is not disabling
mps
it is enabled if it is present by default.
- which layer does this -
torch
orpytorch-forecasting
?- if there is defaulting depending on available or not, why is it failing on
macos-latest
This is on the
torch
layer and the reason why it is failing for the onmacos-latest
is because when installingtorch
on macos the build hasmps
backend built as it doesn't installcpu
only, and as specified in github actions docs https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runnersGiving users the device control whether to enable or disable
mps
is possibleIs this only possible with the patch, is there not a setting in one of the layers? If not, sounds risky as it is not using a public API.
Yeah I agree with this there is no setting in one of the layers for this. Better is explaining the possible challenges that may raise when using mps
including failures for some functions that aren't yet added to torch for mps
and the reason why using PYTORCH_ENABLE_MPS_FALLBACK=1
is advisable for such cases.
Thanks for your explanations, I'm still not getting a good grip of the situation yet, unfortunately.
May I kindly ask you to take some time to explain:
mps
. What makes a machine or VM exactly "supporting mps", and how can I determine whether it does? Optimally, with a link to docs.torch
whether mps is used or not, optimally with a link to docs.
torch
layer, if yes how.pytorch-forecasting
I do not think a regular user will have the error.
It's cased by lack of Nested-virtualization from Apple MacOS. To my understanding, nested-virtualization is some thing that you have a layer of virtualization to provide virtual hardware and the another layer of virtualization to provide hardware and kernel isolated software such as a Hyper-V isolated container or VM. So if a user have a container on a local MacOS, there should be no problem. If a user have a regular container with shared hardware and kernel inside a hypervisor isolated VM, there should be no problem either. If a user have a hypervisor isolated VM or container inside another hypervisor isolated VM or container, there will be a problem. For Github runners, I guess Microsoft are having a first layer of hypervisor to build the cloud (Linux KVM or Microsoft Hyper-V based VMs), then another layer of hypervisor to provide hardware and kernel isolated MacOS VMs.
Reference for Github runner problem: https://github.com/jdb78/pytorch-forecasting/issues/1596#issuecomment-2330768511
some links about Nested-virtualization: https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/nested-virtualization https://cloud.google.com/compute/docs/instances/nested-virtualization/overview
Adding to what @XinyuWuu said above, there a tracker issue for torch layer operations that aren't yet covered on mps
accelerators https://github.com/pytorch/pytorch/issues/77764 which forces user to use the fallback env variable in case some ops should be required in their torch implementation.
Adding to what @XinyuWuu said above, there a tracker issue for torch layer operations that aren't yet covered on
mps
accelerators pytorch/pytorch#77764
Issue with the most participants I've seen so far
I do not think a regular user will have the error.
If this is the case, the my questions are:
I do not think a regular user will have the error.
If this is the case, the my questions are:
- in what precise respect does the VM differ fom the "regular user" condition
In general for the VM the difference from a regular user is the mps support that isn't available on vm as mentioned in limitations in github actions docs here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners.
This implies that anything running on these runners will be run on cpu
- what is the "regular user condition", more generally
For general users, the arm64 macOS has support for mps and they can use both the cpu
and mps
as accelerators.
- how and why can we be sure that the tests cover the regular user condition, post PR?
I believe that post PR the tests will be forced to run on cpu
. This makes it understandable that we can set this as a limitation for arm64 macOS and we can specify that we don't thorough test on mps
also as I think with the current setup we don't generally test on any other accelerators other than cpu
on other runners.
Oh, I see. May I repeat back to check whether I understood, it would be appreciated if you could check my understanding, @fnhirwa:
cpu
backend. They will also have no problems running torch
on the default.torch
tries to use it as default on mac systems, and fails, as not available.torch
to force using cpu
even in a case where it thinks mps
should be present. That is what this PR does.Is this correct?
If it is, I have further questions:
macos-13
and macos-latest
(= macos-14
) runners here? Why does macos-13
not crash without the patch, while macos-14
does? Nothing in the runners page seems to indicate a clear differenc to me.macos-13
or macos-14
, we will always be testing on cpu
backend?
- what exactly differs between
macos-13
andmacos-latest
(=macos-14
) runners here? Why doesmacos-13
not crash without the patch, whilemacos-14
does? Nothing in the runners page seems to indicate a clear differenc to me.
macos-13
is designed for intel chips. torch._C._mps_is_available
will return false in macos-13
even without the patch.
- is it correct that, whether we patch or not, wehtehr we use
macos-13
ormacos-14
, we will always be testing oncpu
backend?
Yes.
Oh, I see. May I repeat back to check whether I understood, it would be appreciated if you could check my understanding, @fnhirwa:
- regular user systems of Mac users will have "MPS" as their standard backend, used by default. They would have to manually deactivate or override to use
cpu
backend. They will also have no problems runningtorch
on the default.- the GH VM do not provide MPS support. Yet
torch
tries to use it as default on mac systems, and fails, as not available.- we can patch
torch
to force usingcpu
even in a case where it thinksmps
should be present. That is what this PR does.- Therefore, we will be testing the condition where the OS is Mac, but no MPS backend is found available.
Is this correct?
Yes, this is correct.
If it is, I have further questions:
- what exactly differs between
macos-13
andmacos-latest
(=macos-14
) runners here? Why doesmacos-13
not crash without the patch, whilemacos-14
does? Nothing in the runners page seems to indicate a clear differenc to me.- is it correct that, whether we patch or not, wehtehr we use
macos-13
ormacos-14
, we will always be testing oncpu
backend?
As @XinyuWuu specified we will always be testing on cpu.
Description
This PR handles the issue that may result when setting the device to
mps
if the torch version doesn't supportmps backend
Depends on https://github.com/jdb78/pytorch-forecasting/pull/1633
I used
pytest.MonkeyPatch()
to disable the discovery of themps
accelerator. The tests run on CPU formacOS-latest
fixes https://github.com/jdb78/pytorch-forecasting/issues/1596