Open ducha-aiki opened 2 years ago
@ducha-aiki i think it's a bit more of work but would be great to know the exact thing that makes this to crash so that we can share with the pytorch-core team ?
Yes, that is what I am figuring out now
It seems that what is crashing, is function assert_close
when one of the arguments is inf or nan.
Example to reproduce
import torch
from torch.testing import assert_close
a = torch.ones(1)
b = torch.zeros(1)
inf = a/b
nan = b/b
cpu = torch.device('cpu')
mps = torch.device('mps')
print ("mps is ok with having nan and inf", inf.to(mps), nan.to(mps))
print ("assert_close on CPU")
try:
assert_close(a.to(cpu), inf.to(cpu))
except Exception as er:
print (er)
print ("assert_close on MPS")
try:
assert_close(a.to(mps), inf.to(mps))
except Exception as er:
print (er)
Output:
mps is ok with having nan and inf tensor([inf], device='mps:0') tensor([nan], device='mps:0')
assert_close on CPU
Tensor-likes are not close!
Mismatched elements: 1 / 1 (100.0%)
Greatest absolute difference: inf at index (0,) (up to 1e-05 allowed)
Greatest relative difference: nan at index (0,) (up to 1.3e-06 allowed)
assert_close on MPS
Comparing
TensorLikePair(
id=(),
actual=tensor([1.], device='mps:0'),
expected=tensor([inf], device='mps:0'),
rtol=1.3e-06,
atol=1e-05,
equal_nan=False,
check_device=True,
check_dtype=True,
check_layout=True,
check_stride=False,
check_is_coalesced=True,
)
resulted in the unexpected exception above. If you are a user and see this message during normal operation please file an issue at https://github.com/pytorch/pytorch/issues. If you are a developer and working on the comparison functions, please except the previous error and raise an expressive `ErrorMeta` instead.
Have opened issue here https://github.com/pytorch/pytorch/issues/77957
Here is an issue on pytorch github to track all the missing operations on M1 https://github.com/pytorch/pytorch/issues/77764
Situation is getting better with version torch==1.13.0.dev20220521
and running with PYTORCH_ENABLE_MPS_FALLBACK=1
. But still there are crashes
https://github.com/pytorch/pytorch/issues/85143 - this blocks a lot of kornia functionality, by crashing
Other than that, which can be temporarily-fixed on our side:
aten::_linalg_det.result
is not supported yet (we can use manual determinant calculation%
(aten::remainder.Tensor_out
) operator not supported - important for color and SIFTcdist
is not supported, we can go around itstd_mean
is not supported. But std
and mean
separately - are supported.aten::_index_put_impl_
- morphology and othersavg_pool3d
- in HyNet
and SOSNet
aten::linalg_cross.out
- in geometryThere are some other stuff, but these are main so far.
I would say that the first goal would be to pass the tests with PYTORCH_ENABLE_MPS_FALLBACK=1
run, and then move more to the native MPS
spatial_gradient
should be refactored -- @lferraz got reported that the perfomance of some functions that rely on it e.g sobel
are extremely slow, My guess is because of conv3d.
As for the rest, i'm fine adapting too.
@edgarriba I will do the things in separate PRs once the basic for testing https://github.com/kornia/kornia/pull/1716 be merged
@edgarriba spatial gradient is fixed here https://github.com/kornia/kornia/pull/1898 There is also "spatial_gradient3d", which is not fixed, but there 3d convolution is actually correct alg and it is much rarery used
test/geometry/test_ransac.py::TestRANSACHomography::test_dirty_points[mps-float32] Fatal Python error: Aborted
test/feature/test_affine_shape_estimator.py::TestLAFAffineShapeEstimator::test_shape[mps] Fatal Python error: Aborted
test/feature/test_matching.py::TestAdalam::test_single_nocrash[mps-float32-adalam_idxs] Fatal Python error: Aborted
test/geometry/liegroup/test_so3.py::TestSo3::test_matrix[mps-float32-1] Fatal Python error: Aborted
test/geometry/liegroup/test_se2.py::TestSe2::test_cardinality[mps-float32-input_shape0] Fatal Python error: Aborted
Hello, I saw the "Make kornia fully-runnable on Apple Silicon chips" project on Google Summer of Code page and I'm interested to help. May I know how I can help specifically? Some ideas on top of my mind:
Also, what is the target PyTorch version? I'm guessing it's PyTorch 2.0?
Hi @gau-nernst ,
Thank you for your interest! Options you mentioned are exactly what would be helpful :) Semi-ordered list:
Crashes. It crashes python altogether.
1) Pick the crash from the list above, find out the minimal reproducing example, report at PyTorch forum, similarly to https://github.com/pytorch/pytorch/issues/86107
2) If you are experienced with Apple Metal, you may try to fix that in PyTorch code.
Failures: it doesn't crash, but throws an error, or gives incorrect results (https://github.com/kornia/kornia/issues/2224). In this case main options would be
a) pick the test, which gives an incorrect result, provide minimal example and report to PyTorch core repo. b) For test, which fail because of the unsupported operations, first check if it is in the list https://github.com/pytorch/pytorch/issues/77764 and if there any work going. c) if it is unlikely to be fixed in PyTorch core, write custom PyTorch code. Instead of multiple dispatch we can just check the device, and if it is ops, call the custom operation.
And yes, we are targeting pytorch 2.0
test/augmentation/test_augmentation_3d.py::TestRandomHorizontalFlip3D::test_random_hflip[mps] Fatal Python error: Aborted
Textual inversion embeddings loaded(0): Model loaded in 8.3s (load weights from disk: 0.3s, create model: 0.8s, apply weights to model: 5.1s, move model to device: 2.0s). ./extensions/SadTalkerm1/checkpoints/auido2pose_00140-model.pth ./extensions/SadTalkerm1/checkpoints/shape_predictor_68_face_landmarks.dat ./extensions/SadTalkerm1/checkpoints/facevid2vid_00189-model.pth.tar /var/folders/6d/69zsy_g132gcld2sxqz7ks6w0000gn/T/gradio/4b4ef30ff7fadbc27f3d443e73adccb08df1d27e/tmpjguvsh4p.png landmark Det:: 100%|██████████████████████████████| 1/1 [00:03<00:00, 3.09s/it] 3DMM Extraction In Video:: 100%|██████████████████| 1/1 [00:00<00:00, 5.03it/s] mel:: 100%|███████████████████████████████| 762/762 [00:00<00:00, 130191.03it/s] audio2exp:: 100%|██████████████████████████████| 77/77 [00:00<00:00, 143.02it/s] Traceback (most recent call last): File "/Users/richard/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 414, in run_predict output = await app.get_blocks().process_api( File "/Users/richard/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1323, in process_api result = await self.call_function( File "/Users/richard/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1051, in call_function prediction = await anyio.to_thread.run_sync( File "/Users/richard/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 28, in run_sync return await get_asynclib().run_sync_in_worker_thread(func, args, cancellable=cancellable, File "/Users/richard/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread return await future File "/Users/richard/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 754, in run result = context.run(func, args) File "/Users/richard/stable-diffusion-webui/modules/call_queue.py", line 15, in f res = func(*args, kwargs) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/gradio_demo.py", line 124, in test return_path = self.animate_from_coeff.generate(data, save_dir, pic_path, crop_info, enhancer='gfpgan' if use_enhancer else None, preprocess=preprocess) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/animate.py", line 149, in generate predictions_video = make_animation(source_image, source_semantics, target_semantics, File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/modules/make_animation.py", line 109, in make_animation kp_canonical = kp_detector(source_image) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/modules/keypoint_detector.py", line 60, in forward feature_map = self.predictor(x) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/modules/util.py", line 365, in forward out = self.up_blocks(out) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/Users/richard/stable-diffusion-webui/extensions/SadTalkerM1/src/facerender/modules/util.py", line 189, in forward out = self.conv(out) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 613, in forward return self._conv_forward(input, self.weight, self.bias) File "/Users/richard/miniconda3/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 608, in _conv_forward return F.conv3d( RuntimeError: Conv3D is not supported on MPS
MPS doesn't support float64 😢
File "/Users/kyledorman/Documents/kelp/.venv/lib/python3.11/site-packages/kornia/geometry/transform/imgwarp.py", line 355, in get_perspective_transform
X: Tensor = _torch_solve_cast(A, b)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kyledorman/Documents/kelp/.venv/lib/python3.11/site-packages/kornia/utils/helpers.py", line 232, in _torch_solve_cast
out = torch.linalg.solve(A.to(torch.float64), B.to(torch.float64))
^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
Describe the bug
This issue is created to list all the tests cases, which are crashing on M1 GPU (torch.device('mps')), so we can at least run all the tests, which are running, failing or not.
Reproduction steps
Expected behavior
skip
Environment
Additional context
No response