pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.01k stars 6.92k forks source link

adjust_hue broken on ARM64 #8574

Open sclarkson opened 1 month ago

sclarkson commented 1 month ago

🐛 Describe the bug

https://github.com/pytorch/vision/pull/8463 introduced a platform/compiler specific cast. See this comment for more details. https://github.com/numpy/numpy/issues/23481#issuecomment-1488011976

I'll run this code snippet on x86_64 and ARM64 machines to demonstrate the problem.

import sys
import platform
import numpy as np

print("Architecture:", platform.machine())
print("Numpy version:", np.__version__)
print("Python version", sys.version)

print()

print("Current code:", np.array(-0.5*255).astype(np.uint8))
print("Proposed code:", np.int32(-0.5*255).astype(np.uint8))

x86_64:

Architecture: x86_64
Numpy version: 2.0.1
Python version 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]

Current code: 129
Proposed code: 129

ARM64:

Architecture: aarch64
Numpy version: 2.0.1
Python version 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]

/root/test.py:11: RuntimeWarning: invalid value encountered in cast
  print("Current code:", np.array(-0.5*255).astype(np.uint8))
Current code: 0
Proposed code: 129

Versions

PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-35-generic-aarch64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                       aarch64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
CPU(s):                             80
On-line CPU(s) list:                0-79
Vendor ID:                          ARM
Model name:                         Neoverse-N1
Model:                              1
Thread(s) per core:                 1
Core(s) per socket:                 80
Socket(s):                          1
Stepping:                           r3p1
Frequency boost:                    disabled
CPU max MHz:                        3000.0000
CPU min MHz:                        1000.0000
BogoMIPS:                           50.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
L1d cache:                          5 MiB (80 instances)
L1i cache:                          5 MiB (80 instances)
L2 cache:                           80 MiB (80 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-79
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; CSV2, BHB
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==2.0.1
[conda] Could not collect
sclarkson commented 3 weeks ago

@NicolasHug

This is still a problem. Here's a more direct example taken from a GH200.

$ pytest test_transforms.py -k test_adjust_hue
================================================= test session starts =================================================
platform linux -- Python 3.12.3, pytest-7.4.4, pluggy-1.4.0
rootdir: /home/ubuntu/vision
configfile: pytest.ini
plugins: anyio-4.4.0, hypothesis-6.98.15, xdist-3.4.0, rerunfailures-12.0, libtmux-0.30.2
collected 1316 items / 1315 deselected / 1 selected                                                                   

test_transforms.py F                                                                                            [100%]

====================================================== FAILURES =======================================================
___________________________________________________ test_adjust_hue ___________________________________________________
test_transforms.py:976: in test_adjust_hue
    torch.testing.assert_close(y_np, y_ans)
E   AssertionError: Tensor-likes are not equal!
E   
E   Mismatched elements: 9 / 12 (75.0%)
E   Greatest absolute difference: 226 at index (1, 0, 1)
E   Greatest relative difference: 5.5 at index (0, 0, 2)
================================================== warnings summary ===================================================
test/test_transforms.py::test_adjust_hue
  /home/ubuntu/.local/lib/python3.12/site-packages/torchvision/transforms/_functional_pil.py:113: RuntimeWarning: invalid value encountered in cast
    np_h += np.array(hue_factor * 255).astype(np.uint8)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============================================== short test summary info ===============================================
FAILED test_transforms.py::test_adjust_hue - AssertionError: Tensor-likes are not equal!
==================================== 1 failed, 1315 deselected, 1 warning in 0.42s ====================================