aarch64 build for AWS Linux - Failed to load image Python extension

elkay commented 8 months ago

🐛 Describe the bug

Built Torch 2.1.2 and TorchVision 0.16.2 from source and running into the following problem:

/home/ec2-user/conda/envs/textgen/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/ec2-user/conda/envs/textgen/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZNK3c1017SymbolicShapeMeta18init_is_contiguousEv'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?

previously the error was about missing libs and not undefined symbol, so I believe the libs are correctly installed now. Building says:

 Compiling extensions with following flags:
    FORCE_CUDA: False
    FORCE_MPS: False
    DEBUG: False
    TORCHVISION_USE_PNG: True
    TORCHVISION_USE_JPEG: True
    TORCHVISION_USE_NVJPEG: True
    TORCHVISION_USE_FFMPEG: True
    TORCHVISION_USE_VIDEO_CODEC: True
    NVCC_FLAGS:
  Compiling with debug mode OFF
  Found PNG library
  Building torchvision with PNG image support
    libpng version: 1.6.37
    libpng include path: /home/ec2-user/conda/envs/textgen/include/libpng16
  Running build on conda-build: False
  Running build on conda: True
  Building torchvision with JPEG image support
    libjpeg include path: /home/ec2-user/conda/envs/textgen/include
    libjpeg lib path: /home/ec2-user/conda/envs/textgen/lib
  Building torchvision without NVJPEG image support
  Building torchvision with ffmpeg support
    ffmpeg version: b'ffmpeg version 4.2.2 Copyright (c) 2000-2019 the FFmpeg developers\nbuilt with gcc 10.2.0 (crosstool-NG 1.22.0.1750_510dbc6_dirty)\nconfiguration: --prefix=/opt/conda/conda-bld/ffmpeg_1622823166193/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh --cc=/opt/conda/conda-bld/ffmpeg_1622823166193/_build_env/bin/aarch64-conda-linux-gnu-cc --disable-doc --enable-avresample --enable-gmp --enable-hardcoded-tables --enable-libfreetype --enable-libvpx --enable-pthreads --enable-libopus --enable-postproc --enable-pic --enable-pthreads --enable-shared --enable-static --enable-version3 --enable-zlib --enable-libmp3lame --disable-nonfree --enable-gpl --enable-gnutls --disable-openssl --enable-libopenh264 --enable-libx264\nlibavutil      56. 31.100 / 56. 31.100\nlibavcodec     58. 54.100 / 58. 54.100\nlibavformat    58. 29.100 / 58. 29.100\nlibavdevice    58.  8.100 / 58.  8.100\nlibavfilter     7. 57.100 /  7. 57.100\nlibavresample   4.  0.  0 /  4.  0.  0\nlibswscale      5.  5.100 /  5.  5.100\nlibswresample   3.  5.100 /  3.  5.100\nlibpostproc    55.  5.100 / 55.  5.100\n'
    ffmpeg include path: ['/home/ec2-user/conda/envs/textgen/include']
    ffmpeg library_dir: ['/home/ec2-user/conda/envs/textgen/lib']
  Building torchvision without video codec support

So I believe I do have things set up correctly to be able to do image calls (I don't care about video). Any idea why I would still be getting the undefined symbol warning? Thanks!

Versions

Collecting environment information... PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.2 ROCM used to build PyTorch: N/A

OS: Amazon Linux 2023.3.20240304 (aarch64) GCC version: (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2) Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.34

Python version: 3.10.9 (main, Mar 8 2023, 10:41:45) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.1.79-99.164.amzn2023.aarch64-aarch64-with-glibc2.34 Is CUDA available: True CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA T4G Nvidia driver version: 550.54.14 cuDNN version: Probably one of the following: /usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn.so.8.9.4 /usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_adv_infer.so.8.9.4 /usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_adv_train.so.8.9.4 /usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_cnn_infer.so.8.9.4 /usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_cnn_train.so.8.9.4 /usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_ops_infer.so.8.9.4 /usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_ops_train.so.8.9.4 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: r3p1 BogoMIPS: 243.75 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs L1d cache: 256 KiB (4 instances) L1i cache: 256 KiB (4 instances) L2 cache: 4 MiB (4 instances) L3 cache: 32 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.1.2+cu121 [pip3] torchaudio==2.1.2 [pip3] torchvision==0.16.2+cu121 [pip3] triton==2.1.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] torch 2.1.2+cu121 pypi_0 pypi [conda] torchaudio 2.1.2 pypi_0 pypi [conda] torchvision 0.16.2+cu121 pypi_0 pypi [conda] triton 2.1.0 pypi_0 pypi

NicolasHug commented 8 months ago

 [pip3] torchvision==0.16.2+cu121
 [conda] torchvision 0.16.2+cu121 pypi_0 pypi

Try uninstalling these versions first?

elkay commented 8 months ago

 [pip3] torchvision==0.16.2+cu121
 [conda] torchvision 0.16.2+cu121 pypi_0 pypi
Try uninstalling these versions first?

What would that accomplish? That's literally the package that I'm trying to use and that is throwing the error.

NicolasHug commented 8 months ago

Built Torch 2.1.2 and TorchVision 2.1.2 from source

What version of torchvision are you building from source, exactly? There's no torchvision 2.x. The latest stable version is 0.17.

The fact that there already is a stable 0.16.2 version installed while you're trying to build from source is very likely to be causing some issues.

elkay commented 8 months ago

Built Torch 2.1.2 and TorchVision 2.1.2 from source

What version of torchvision are you building from source, exactly? There's no torchvision 2.x. The latest stable version is 0.17.

The fact that there already is a stable 0.16.2 version installed while you're trying to build from source is very likely to be causing some issues.

Updated original post, torchvision version was a typo.

I did finally get torchvision to build and be functional, but only by forcibly editing the build scripts to pull in my custom build of torch+cuda 2.1.2. The build scripts were importing a non-cuda build because there is no aarch64 torch+cuda out there for pip to pull down. So finally, after forcing my own torch+cuda 2.1.2 whl into the torchvision build, now my torchvision actually works.

I need to say - it's been PAINFUL dealing with building anything that relies on torch because all the build scripts pull down the non-cuda version and mess up the builds. Every time I want to build something relying on torch, now I need to hack in pulling my own torch whl instead for them to work (this also resolved issues I was having building a few other things).

I reaaaaaally hope official aarch64 torch+cuda builds start to be made available so I don't have to keep doing this hackjob.

NicolasHug commented 8 months ago

What build script are you referring to? Can you share the build command you used?

elkay commented 8 months ago

The box is shut down but I believe it was pyproject.toml that I had to update to point directly at my torch whl and the command I used was "python setup.py bdist_wheel". I had the same outcomes with "pip install -v ." to directly install from source, though.

pytorch / vision

aarch64 build for AWS Linux - Failed to load image Python extension #8305

🐛 Describe the bug

Versions