kornia / kornia

Geometric Computer Vision Library for Spatial AI
https://kornia.readthedocs.io
Apache License 2.0
10.03k stars 975 forks source link

List of crashing tests in M1 GPU (mps) #1717

Open ducha-aiki opened 2 years ago

ducha-aiki commented 2 years ago

Describe the bug

This issue is created to list all the tests cases, which are crashing on M1 GPU (torch.device('mps')), so we can at least run all the tests, which are running, failing or not.

Reproduction steps

skip

Expected behavior

skip

Environment

skip

Additional context

No response

edgarriba commented 2 years ago

@ducha-aiki i think it's a bit more of work but would be great to know the exact thing that makes this to crash so that we can share with the pytorch-core team ?

ducha-aiki commented 2 years ago

Yes, that is what I am figuring out now

ducha-aiki commented 2 years ago

It seems that what is crashing, is function assert_close when one of the arguments is inf or nan.

Example to reproduce

import torch
from torch.testing import assert_close

a = torch.ones(1)
b = torch.zeros(1)
inf = a/b
nan = b/b

cpu = torch.device('cpu')
mps = torch.device('mps')
print ("mps is ok with having nan and inf", inf.to(mps), nan.to(mps))
print ("assert_close on CPU")
try:
    assert_close(a.to(cpu), inf.to(cpu))
except Exception as er:
    print (er)

print ("assert_close on MPS")
try:
    assert_close(a.to(mps), inf.to(mps))
except Exception as er:
    print (er)

Output:

mps is ok with having nan and inf tensor([inf], device='mps:0') tensor([nan], device='mps:0')
assert_close on CPU
Tensor-likes are not close!

Mismatched elements: 1 / 1 (100.0%)
Greatest absolute difference: inf at index (0,) (up to 1e-05 allowed)
Greatest relative difference: nan at index (0,) (up to 1.3e-06 allowed)

assert_close on MPS
Comparing

TensorLikePair(
    id=(),
    actual=tensor([1.], device='mps:0'),
    expected=tensor([inf], device='mps:0'),
    rtol=1.3e-06,
    atol=1e-05,
    equal_nan=False,
    check_device=True,
    check_dtype=True,
    check_layout=True,
    check_stride=False,
    check_is_coalesced=True,
)

resulted in the unexpected exception above. If you are a user and see this message during normal operation please file an issue at https://github.com/pytorch/pytorch/issues. If you are a developer and working on the comparison functions, please except the previous error and raise an expressive `ErrorMeta` instead.
ducha-aiki commented 2 years ago

Have opened issue here https://github.com/pytorch/pytorch/issues/77957

ducha-aiki commented 2 years ago

https://github.com/pytorch/pytorch/issues/77958

ducha-aiki commented 2 years ago

Here is an issue on pytorch github to track all the missing operations on M1 https://github.com/pytorch/pytorch/issues/77764

ducha-aiki commented 2 years ago

Situation is getting better with version torch==1.13.0.dev20220521 and running with PYTORCH_ENABLE_MPS_FALLBACK=1. But still there are crashes

ducha-aiki commented 2 years ago

https://github.com/pytorch/pytorch/issues/78247

ducha-aiki commented 2 years ago

https://github.com/pytorch/pytorch/issues/85143 - this blocks a lot of kornia functionality, by crashing

Other than that, which can be temporarily-fixed on our side:

There are some other stuff, but these are main so far. I would say that the first goal would be to pass the tests with PYTORCH_ENABLE_MPS_FALLBACK=1 run, and then move more to the native MPS

edgarriba commented 2 years ago

spatial_gradient should be refactored -- @lferraz got reported that the perfomance of some functions that rely on it e.g sobel are extremely slow, My guess is because of conv3d.

As for the rest, i'm fine adapting too.

ducha-aiki commented 2 years ago

@edgarriba I will do the things in separate PRs once the basic for testing https://github.com/kornia/kornia/pull/1716 be merged

ducha-aiki commented 2 years ago

@edgarriba spatial gradient is fixed here https://github.com/kornia/kornia/pull/1898 There is also "spatial_gradient3d", which is not fixed, but there 3d convolution is actually correct alg and it is much rarery used

ducha-aiki commented 2 years ago

https://github.com/pytorch/pytorch/issues/86107

ducha-aiki commented 1 year ago
test/geometry/test_ransac.py::TestRANSACHomography::test_dirty_points[mps-float32] Fatal Python error: Aborted
ducha-aiki commented 1 year ago
test/feature/test_affine_shape_estimator.py::TestLAFAffineShapeEstimator::test_shape[mps] Fatal Python error: Aborted
test/feature/test_matching.py::TestAdalam::test_single_nocrash[mps-float32-adalam_idxs] Fatal Python error: Aborted
ducha-aiki commented 1 year ago
test/geometry/liegroup/test_so3.py::TestSo3::test_matrix[mps-float32-1] Fatal Python error: Aborted
test/geometry/liegroup/test_se2.py::TestSe2::test_cardinality[mps-float32-input_shape0] Fatal Python error: Aborted
gau-nernst commented 1 year ago

Hello, I saw the "Make kornia fully-runnable on Apple Silicon chips" project on Google Summer of Code page and I'm interested to help. May I know how I can help specifically? Some ideas on top of my mind:

Also, what is the target PyTorch version? I'm guessing it's PyTorch 2.0?

ducha-aiki commented 1 year ago

Hi @gau-nernst ,

Thank you for your interest! Options you mentioned are exactly what would be helpful :) Semi-ordered list:

Crashes. It crashes python altogether.

1) Pick the crash from the list above, find out the minimal reproducing example, report at PyTorch forum, similarly to https://github.com/pytorch/pytorch/issues/86107

2) If you are experienced with Apple Metal, you may try to fix that in PyTorch code.

Failures: it doesn't crash, but throws an error, or gives incorrect results (https://github.com/kornia/kornia/issues/2224). In this case main options would be

a) pick the test, which gives an incorrect result, provide minimal example and report to PyTorch core repo. b) For test, which fail because of the unsupported operations, first check if it is in the list https://github.com/pytorch/pytorch/issues/77764 and if there any work going. c) if it is unlikely to be fixed in PyTorch core, write custom PyTorch code. Instead of multiple dispatch we can just check the device, and if it is ops, call the custom operation.

And yes, we are targeting pytorch 2.0

ducha-aiki commented 1 year ago

test/augmentation/test_augmentation_3d.py::TestRandomHorizontalFlip3D::test_random_hflip[mps] Fatal Python error: Aborted

crazyfish2020 commented 1 year ago

Textual inversion embeddings loaded(0): Model loaded in 8.3s (load weights from disk: 0.3s, create model: 0.8s, apply weights to model: 5.1s, move model to device: 2.0s). ./extensions/SadTalkerm1/checkpoints/auido2pose_00140-model.pth ./extensions/SadTalkerm1/checkpoints/shape_predictor_68_face_landmarks.dat ./extensions/SadTalkerm1/checkpoints/facevid2vid_00189-model.pth.tar /var/folders/6d/69zsy_g132gcld2sxqz7ks6w0000gn/T/gradio/4b4ef30ff7fadbc27f3d443e73adccb08df1d27e/tmpjguvsh4p.png landmark Det:: 100%|██████████████████████████████| 1/1 [00:03<00:00, 3.09s/it] 3DMM Extraction In Video:: 100%|██████████████████| 1/1 [00:00<00:00, 5.03it/s] mel:: 100%|███████████████████████████████| 762/762 [00:00<00:00, 130191.03it/s] audio2exp:: 100%|██████████████████████████████| 77/77 [00:00<00:00, 143.02it/s] Traceback (most recent call last): File "/Users/richard/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 414, in run_predict output = await app.get_blocks().process_api( File "/Users/richard/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1323, in process_api result = await self.call_function( File "/Users/richard/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1051, in call_function prediction = await anyio.to_thread.run_sync( File "/Users/richard/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 28, in run_sync return await get_asynclib().run_sync_in_worker_thread(func, args, cancellable=cancellable, File "/Users/richard/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread return await future File "/Users/richard/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 754, in run result = context.run(func, args) File "/Users/richard/stable-diffusion-webui/modules/call_queue.py", line 15, in f res = func(*args, kwargs) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/gradio_demo.py", line 124, in test return_path = self.animate_from_coeff.generate(data, save_dir, pic_path, crop_info, enhancer='gfpgan' if use_enhancer else None, preprocess=preprocess) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/animate.py", line 149, in generate predictions_video = make_animation(source_image, source_semantics, target_semantics, File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/modules/make_animation.py", line 109, in make_animation kp_canonical = kp_detector(source_image) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/modules/keypoint_detector.py", line 60, in forward feature_map = self.predictor(x) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/modules/util.py", line 365, in forward out = self.up_blocks(out) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/modules/util.py", line 189, in forward out = self.conv(out) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 613, in forward return self._conv_forward(input, self.weight, self.bias) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 608, in _conv_forward return F.conv3d( RuntimeError: Conv3D is not supported on MPS

kyle-dorman commented 2 months ago

MPS doesn't support float64 😢

  File "/Users/kyledorman/Documents/kelp/.venv/lib/python3.11/site-packages/kornia/geometry/transform/imgwarp.py", line 355, in get_perspective_transform
    X: Tensor = _torch_solve_cast(A, b)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kyledorman/Documents/kelp/.venv/lib/python3.11/site-packages/kornia/utils/helpers.py", line 232, in _torch_solve_cast
    out = torch.linalg.solve(A.to(torch.float64), B.to(torch.float64))
                             ^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.