lamikr / rocm_sdk_builder

Other
113 stars 8 forks source link

gfx1102 : import torchaudio : Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41b40) #74

Closed jrl290 closed 5 days ago

jrl290 commented 1 week ago

Built rocm_sdk_builder on Ubuntu 22.04 with Linux Kernel 6.10-rc2

Simple import torchaudio yielded

minipc@minipc:~/aipython$ python
Python 3.9.19 (tags/v3.9.19-dirty:882f62bd93, Jun 14 2024, 12:20:42)
[GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torchaudio
[minipc:2334463:0:2334463] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41b40)
Segmentation fault (core dumped)

Upgraded torchaudio with:

minipc@minipc:~/aipython$ pip install --upgrade torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
...
Successfully installed pytorch-triton-rocm-2.3.1 torch-2.3.1+rocm6.0 torchaudio-2.3.1+rocm6.0

Import afterward resulted in:

minipc@minipc:~/aipython$ python
Python 3.9.19 (tags/v3.9.19-dirty:882f62bd93, Jun 14 2024, 12:20:42)
[GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torchaudio
Segmentation fault (core dumped)
jeroen-mostert commented 1 week ago

Can confirm the same happens for me on a wildly different configuration: Manjaro, ROCm 6.1.2, Python 3.11.9, gfx1030 (RX 6800 XT), gcc 14.1.1.

lamikr commented 1 week ago

I need to check if latest changes/torch update broke something. Out of home now, but I have some little older build on my laptop and just tested following and at least with this one the audio at least imports.

import torch
import torchaudio
import torchaudio.functional as F
import torchaudio.transforms as T

print(torch.__version__)
print(torchaudio.__version__)

And will print:

2.3.0a0+git8ce7685
2.3.1+9b72aa0

Another thing I tested that whisper was able to get lyrics from mp3 song I tested.

source /opt/rocm_sdk_612/bin/env_rocm.sh pip3 install openai-whisper whisper --model small testsong.mp3

jeroen-mostert commented 1 week ago

FWIW pip show torchaudio gives 2.3.0+fa58a23 on my 6.1.1 installation, and 2.3.1+4825a2e on my 6.1.2 installation. Both segfault on import.

torch is 2.3.0a0+git08c81de on my 6.1.1 installation and 2.3.0a0+git21912dc on my 6.1.2 installation. torch itself imports fine (and reports acceleration), as does torchvision.

A stack trace gives no interesting information, and while I've tried to build torchaudio in debug configuration (and seemingly succeeded) that yields no additional info either. The segfault happens on a clearly invalid IP that's either a jump to nowhere or the result of a return address scribbled over by stack corruption. That's as much troubleshooting as I have time for.

jrl290 commented 1 week ago

Can confirm whisper executes successfully while import torchaudio segfaults

lamikr commented 1 week ago

Just verified that rebuilded up to date torchaudio still works for me on mageia 9/6900hs laptop. Which Linux distro you are using?

jrl290 commented 1 week ago

Just verified that rebuilded up to date torchaudio still works for me on mageia 9/6900hs laptop. Which Linux distro you are using?

Ubuntu 22.04 with Linux Kernel 6.10-rc2

lamikr commented 1 week ago

Thanks for confirming, I will try to reproduce this. In the meantime I re-tested with some audio examples from

https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html

that they still works for me on latest build and give similar results that that tutorial.

jrl290 commented 1 week ago

Update: I tried import torchaudio immediately after ./babs.sh -i and it loaded without error

After I ran apt install for some dependencies for my use case, the error appeared again. Here is the list of dependencies I installed:

sudo apt install ffmpeg python3-pip python3-tk qtcreator qtbase5-dev qt5-qmake cmake libnuma-dev imagemagick libsndfile-dev libcairo2-dev pkg-config python3-dev libgirepository1.0-dev libjpeg-dev zlib1g-dev

I tried apt remove --purge to no avail. But I will start over and add them one by one to see which one is triggering it. Also many of these are redundant, so I will look to only add the packages that are not already installed

jeroen-mostert commented 1 week ago

Good find, ffmpeg seems to be the issue. The included libtorio_ffmpeg pulls in a lot of system libraries. If I specifically make libtorio_ffmpeg6 unavailable to dlopen, torchaudio imports without segfaulting and a simple speech rec tutorial works (and is using acceleration). I don't know what functionality is dependent on this library or how it should be built to avoid trouble.

jrl290 commented 1 week ago

Oh excellent! What did you do to block libtorio_ffmpeg6?

jeroen-mostert commented 1 week ago

mv /opt/rocm_sdk[...]/lib/python[...]/site-packages/torchaudio[...].egg/torio/lib/libtorio_ffmpeg6.so libtorio_ffmpeg6.so.bak. :P

Obviously, don't try this at home except for troubleshooting purposes, this is not a stable solution.

jeroen-mostert commented 1 week ago

Per this, there are knobs to twist to influence the ffmpeg dependency. Setting TORIO_USE_FFMPEG_VERSION=5 on my system also avoids the segfault -- notably, however, because I don't have version 5 installed using a class like StreamingMediaDecoder simply fails. Oddly enough, while TORIO_USE_FFMPEG=0 promises to turn off ffmpeg integration entirely, on my system that does not avoid the segfault.

Of course none of this is a solution to people who need the StreamingMediaDecoder (aka StreamReader)/StreamingMediaEncoder classes, which is what torio provides.

For completeness & comparison with other systems, the version information as provided by ffmpeg:

ffmpeg version n6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 13.2.1 (GCC) 20230801
  configuration: --prefix=/usr --disable-debug --disable-static --disable-stripping --enable-amf --enable-avisynth --enable-cuda-llvm --enable-lto --enable-fontconfig --enable-frei0r --enable-gmp --enable-gnutls --enable-gpl --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libdav1d --enable-libdrm --enable-libfreetype --enable-libfribidi --enable-libgsm --enable-libharfbuzz --enable-libiec61883 --enable-libjack --enable-libjxl --enable-libmodplug --enable-libmp3lame --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libplacebo --enable-libpulse --enable-librav1e --enable-librsvg --enable-librubberband --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libv4l2 --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpl --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxcb --enable-libxml2 --enable-libxvid --enable-libzimg --enable-nvdec --enable-nvenc --enable-opencl --enable-opengl --enable-shared --enable-vapoursynth --enable-version3 --enable-vulkan
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    60. 16.100 / 60. 16.100
  libavdevice    60.  3.100 / 60.  3.100
  libavfilter     9. 12.100 /  9. 12.100
  libswscale      7.  5.100 /  7.  5.100
  libswresample   4. 12.100 /  4. 12.100
  libpostproc    57.  3.100 / 57.  3.100

Also, installing vanilla torchaudio and numpy in a new venv (so not the custom ROCm version) succeeds and results in torchaudio.list_audio_backends() reporting ffmpeg.

lamikr commented 1 week ago

I am still not able to reproduce this on Ubuntu 22.04.4 or on Mageia 9. I did a clean build on rocm_sdk_builder 6.1.2 for gfx1035 on ubuntu 22.04.4 with all updates installed.

Then I followed this tutorial:

https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html

If I run the attached torch_audio_play.py it will work and show same diagrams from the audio than on tutorial.

source /opt/rocm_sdk_612/bin/env_rocm.sh
pip install pyqt5 librosa boto3
python -i torch_audio_play.py
Sample Rate: 16000
Shape: (1, 54400)
Dtype: torch.float32
 - Max:      0.668
 - Min:     -1.000
 - Mean:     0.000
 - Std Dev:  0.122

tensor([[0.0183, 0.0180, 0.0180,  ..., 0.0018, 0.0019, 0.0032]])

I have these libtorio libs installed under rocm_sdk_612

./lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torio/lib/libtorio_ffmpeg4.so
./lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torio/lib/libtorio_ffmpeg6.so
./lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torio/lib/libtorio_ffmpeg5.so

torch_audio_play.py.txt

lamikr commented 1 week ago

@jrl290 Would it help if you rebuild the pytorch, pytorch vision and audio on sdk after you have installed the additional packages. (ffmpeg, etc)?

You can rebuild them by removing their directories from builddir

jeroen-mostert commented 1 week ago

So, I think I've found the cause and a possible solution.

The main problem appears to be that torchaudio downloads statically built versions of FFmpeg to build and link against, but at runtime, the system's FFmpeg is dynloaded. As a result, any version discrepancy has the potential to cause things to blow up. The packaged FFmpeg links against libavutil.so.58.2.100, for example, but my system has libavutil.so.58.29.100. It's not hard to imagine this going wrong. (That said, I cannot explain from this alone why a vanilla pip install torchaudio in a venv doesn't cause any issues, since it should have the same problem -- maybe there's an include file/architecture/optimization mixup in our builds?)

As luck (?) would have it, as of today, Arch/Manjaro have upgraded to FFmpeg 7. As a result torchaudio no longer crashes, but that's because it can't load FFmpeg at all anymore. There appears to be no work ongoing in the repo to include FFmpeg 7 support yet.

I understand FFmpeg has some LGPL/GPL weirdness going on in the licensing that encourages this kind of stuff. It may be necessary or at least desirable to build our own FFmpeg specifically to avoid these issues, though I don't know if torchaudio should be leading on that.

I was able to make torchaudio work again by simply copying the FFmpeg libs it uses to the ROCm directory:

curl https://pytorch.s3.amazonaws.com/torchaudio/ffmpeg/2023-07-06/linux_x86_64/6.0.tar.gz | tar x
cp -a ffmpeg/lib/lib* /opt/rocm_sdk_612/lib

The URL is taken from pytorch_audio/third_party/ffmpeg/multi/CMakeLists.txt.

After this, torchaudio imports and confirms that FFmpeg is supported:

>>> import torchaudio
>>> torchaudio.list_audio_backends()
['ffmpeg', 'soundfile']

(If this code only outputs ['soundfile'], or [], FFmpeg wasn't loaded.)

We may be able to integrate this into the build, since CMake ends up downloading this file at some point, but I'm no CMake wizard so I don't know how exactly this should be done.

Note that I haven't tested if there's any potential conflicts with an existing FFmpeg yet, since I can't, since my system has no FFmpeg 6 anymore. :P I have also not tested if the library files torchaudio downloads are in any way optimized or usable across distros; if they're bare-bones unoptimized it again may be necessary to do a custom optimized build to include in the ROCm SDK itself (if possible), as at least Arch has seen fit to make the jump to FFmpeg 7 and other distros may follow.

jeroen-mostert commented 1 week ago

Would it help if you rebuild the pytorch, pytorch vision and audio on sdk after you have installed the additional packages. (ffmpeg, etc)?

You can rebuild them by removing their directories from builddir

Important note: simply removing the builddir is not sufficient to get these to rebuild cleanly. They each leave intermediates in their src_projects/pytorch*/build directory that have to be removed (by removing the whole build directory). This is another potential cause for trouble if you're not in the habit of rebuilding from scratch every time.

lamikr commented 1 week ago

To resolve the cleanup issue, I have patched python projects to have preconfig_*.sh script. For example if you remove the pytorch folder from builddir and then call ./babs.sh -b pytorch will call "./preconfig_pytorch_rocm.sh ${INSTALL_DIR_PREFIX_SDK_ROOT}"

In pytorch case the preconfig script calls now pythons setup.py clean.

cat src_projects/pytorch/preconfig_pytorch_rocm.sh 
if [ -z "$1" ]; then
    install_dir_prefix_rocm=/opt/rocm
    echo "No rocm_root_directory_specified, using default: ${install_dir_prefix_rocm}"
else
    install_dir_prefix_rocm=${1}
    echo "using rocm_root_directory specified: ${install_dir_prefix_rocm}"
fi
unset LDFLAGS
unset CFLAGS
unset CPPFLAGS
unset PKG_CONFIG_PATH
if [ -e ./preconfig_pytorch_rocm.sh ]; then
    if [ -d ./build ]; then
        #rm -rf build
        #rm -rf torch (this is needed to really get all files regenerated for hip)
        #git status | xargs -- rm -rf
        #git reset --hard
        #git submodule update --init --recursive
        python setup.py clean
    fi
fi

I had there earlier the "rm -rf build" but thought that it's not needed, as pytorch setup,py seemed to implement clean method. Do you think that should be changed to just call rm -rf "build"?

jeroen-mostert commented 1 week ago

Oh, that may be my bad -- maybe I just remembered the problem from earlier builds and did not recheck. Indeed, if I try that now it appears to work correctly.

lamikr commented 1 week ago

I am checking the cmakefiles now for ffmpeg support and I can see at least 3 possible way to solve the issue.

1) Configure option to allow user to select whether to use linux distro version of libraries or the one offered by us. (This option seems to work on some distros as I was not able to trigger the bug. By default this would need to be off to support all distros. Another problem is that the config-menu system I use does not support it at least yet.)

2) We build our own ffmpeg in a same way than we build cmake, zstd, python, boost, gtest and some other basic libraries to quarantee compatibility with distros. The used ffmpeg would then need to be plain version without any tainted/bad/uggly features that some builds offer for code that has doubt of patent problems, etc. in some countries. That means that support for some codecs could be missing. At the moment I think we would build the latest ffmpeg 6.0 version

This option is selectable in pytorch_audio's root CMakeLists.txt

  if (DEFINED ENV{FFMPEG_ROOT})
    add_subdirectory(third_party/ffmpeg/single)

3) We build the system just like now and and modify the src_projects/pytorch_audio/package_pytorch_audio_rocm_wheel.sh so that it will copy the so files from build/temp.linux-x86_64-cpython-39/_deps/f6-src/lib/ folder to ${install_dir_prefix_rocm}/lib64 or ${install_dir_prefix_rocm}/iib folder.

Option (3) is easiest to implement but it has also own problems because it's possible that the so-files extracted from the 6.0.tar.gz are not compatible in all distributions. (They are anyway linked to some other files that are expected to be on distro).

I think we should anyway add first the support for option 3 and then check how to build the ffmpeg from source and use that version for all packages.

jeroen-mostert commented 1 week ago

Yeah, the codec support may be an issue for real use cases and may be motivation to find some way to make things work in a stable way with the system-provided FFmpeg in all cases, but that seems to be difficult without overhauling the way pytorch does things. It's not entirely clear to me why pytorch does things the way it does, with the weird mix of linking against its own copies but then hoping that the version on the system will be compatible with it while dynamically loading. Either go full static or full dynamic, not this weird mix with potential ABI problems. But then I have no experience developing against FFmpeg so maybe this is just my ignorance showing. :P

jeroen-mostert commented 1 week ago

I just checked with torio a bit more and it appears the included libs are indeed specifically built so things can link, with no codec support whatsoever, so practically speaking this way of supporting FFmpeg is useless in any case. In this case FFmpeg support should simply be disabled as including the statically linked files suggests support while there is no actual support. That means option 3 is a no-go.

In fact now that Arch/Manjaro have FFmpeg 7 I should probably do a full rebuild to see if things even work for other packages anymore; audio will get by with no support but I'm not sure about the rest. There is not yet a compatibility package for FFmpeg 6 in the Arch repos, though I suspect one will pop up in AUR before too long (as there are packages for FFmpeg 4 and 5).

lamikr commented 1 week ago

In Mageia 9, the copying of the so files from pytorch_audio dir actually breaks the torch_audio.

cp -axf ./build/temp.linux-x86_64-cpython-39/_deps/f6-src/lib/* /opt/rocm_sdk_612/lib

causes


python torch_audio_import.py 
2.3.0a0+git94f83d9
2.3.1+9b72aa0
[lamikr@localhost torch_audio]$ python torch_audio_play.py 
Traceback (most recent call last):
  File "/home/lamikr/own/rocm/src/ml_models/examples/pytorch/torch_audio/torch_audio_play.py", line 369, in <module>
    waveform, sample_rate = torchaudio.load("./speech.wav")
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torchaudio/_backend/utils.py", line 205, in load
    return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torchaudio/_backend/ffmpeg.py", line 297, in load
    return load_audio(uri, frame_offset, num_frames, normalize, channels_first, format)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torchaudio/_backend/ffmpeg.py", line 88, in load_audio
    s = torchaudio.io.StreamReader(src, format, None, buffer_size)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torio/io/_streaming_media_decoder.py", line 526, in __init__
    self._be = ffmpeg_ext.StreamingMediaDecoder(os.path.normpath(src), format, option)
RuntimeError: Failed to open the input "speech.wav" (Protocol not found).
Exception raised from get_input_format_context at /home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/pytorch_audio/src/libtorio/ffmpeg/stream_reader/stream_reader.cpp:48 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xa9 (0x7f4b7699aff9 in /opt/rocm_sdk_612/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xc2 (0x7f4b7694a306 in /opt/rocm_sdk_612/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x319d1 (0x7f4a99dc49d1 in /opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torio/lib/libtorio_ffmpeg6.so)
frame #3: torio::io::StreamingMediaDecoder::StreamingMediaDecoder(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > const&) + 0x14 (0x7f4a99dc4a44 in /opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torio/lib/libtorio_ffmpeg6.so)
frame #4: <unknown function> + 0x390cd (0x7f4a997d80cd in /opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torio/lib/_torio_ffmpeg6.so)
frame #5: <unknown function> + 0x38fbe (0x7f4a997d7fbe in /opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torio/lib/_torio_ffmpeg6.so)
frame #6: <unknown function> + 0x2355d (0x7f4a997c255d in /opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torio/lib/_torio_ffmpeg6.so)
<omitting python frames>
frame #12: <unknown function> + 0xaa9b (0x7f4b34f2ca9b in /opt/rocm_sdk_612/lib/python3.9/site-packages/torchaudio-2.3.1+9b72aa0-py3.9-linux-x86_64.egg/torchaudio/lib/_torchaudio.so)
frame #41: <unknown function> + 0x23737 (0x7f4bb2a51737 in /lib64/libc.so.6)
frame #42: __libc_start_main + 0x85 (0x7f4bb2a517f5 in /lib64/libc.so.6)
jeroen-mostert commented 1 week ago

For completely up-to-date versions of Arch/Manjaro, where FFmpeg is now at version 7, a workaround for now appears to be to install the ffmpeg4.4 compatibility package from extra or, if you're feeling adventurous enough to use AUR, the ffmpeg5.1 package (the latter does not build on my system, however). Both of these saddle torchaudio with crusty old versions, but at least they work and provide codecs. This is not a solution for the problem of the OP, or any other distro where there may be an incompatibility between the minor versions of FFmpeg.

lamikr commented 1 week ago

So Arch and Manjaro does have ffmpeg 4, 5 and 7 but not 6?

There is also option (4) to test Install distro specific ffmpged-devel package. In Mageia case it was ffmpeg-devel Then add to src_projects/pytorch_audio/build_pytorch_audio_rocm.sh a following line: export FFMPEG_ROOT=/usr

So file needs to look like this:

if [ -z "$1" ]; then
    install_dir_prefix_rocm=/opt/rocm
    echo "No rocm_root_directory_specified, using default: ${install_dir_prefix_rocm}"
else
    install_dir_prefix_rocm=${1}
    echo "using rocm_root_directory specified: ${install_dir_prefix_rocm}"
fi
unset LDFLAGS
unset CFLAGS
unset CPPFLAGS
unset PKG_CONFIG_PATH
export CMAKE_C_COMPILER=${install_dir_prefix_rocm}/bin/hipcc
export CMAKE_CXX_COMPILER=${install_dir_prefix_rocm}/bin/hipcc
export FFMPEG_ROOT=/usr
ROCM_PATH=${install_dir_prefix_rocm} CMAKE_PREFIX_PATH="${install_dir_prefix_rocm};${install_dir_prefix_rocm}/lib64/cmake" USE_ROCM=1 USE_FFMPEG=1 USE_OPENMP=1 CC=${CMAKE_C_COMPILER} CXX=${CMAKE_CXX_COMPILER} python setup.py install

You need to remove the builddir/039_04_pytorch_audio and rocm_sdk_612/lib/libav* files you may have copied. In this way the pytorch_audio try to find distro specific ffmpeg headers under /usr and build agains them.

If you want to debug this, you can add this line temporarily to message(FATAL_ERROR ("FFMPEG_DIR: ${_root}")

to src_projects/pytorch_audio/third_party/ffmpeg/single/CMakeLists.txt

It will then stop there and printout whether it found the ffmpeg libraries. After that you can remove the debug line that caused cmake to stop and issue the build command again.

jeroen-mostert commented 1 week ago

Arch/Manjaro have a package for FFmpeg 4 for compatibility. There is an unofficial, user-maintained package for FFmpeg 5. There is not yet any package to provide FFmpeg 6, because, up until yesterday, that was the official version used before upgrading to FFmpeg 7. So if you're on Arch or Manjaro unstable and you're up to date, you currently have no way of building torch with FFmpeg support (unless you deliberately downgrade the package, which is not generally recommended as it can easily break things on a rolling distro).

Arch packages generally provide headers without the need for a separate development package and ffmpeg is no different, so yes you can build against the system FFmpeg this way. Unfortunately, this does not work as the source is not compatible with the changes in FFmpeg 7 (which is not unexpected as they explicitly required a <7 version, after all). Patching in support for that looks decidedly nontrivial.

This may be an option for Ubuntu or other systems if they offer devel packages for FFmpeg though, so that's something that could be explored for the OP. The issue might need to be split between "how to make things compatible for distros that offer FFmpeg 4-5-6 but might have a minor incompatibility" vs. "how to support FFmpeg 7", as the latter is a much bigger thing. Eventually upstream should get to that, though it might take a while. :P

jeroen-mostert commented 1 week ago

Bad news (possibly): after manually extracting the FFmpeg 6 libraries from the archived package into a separate directory and building torchaudio against it, the resulting package still fails with a segfault, even though the package now only contains a module for the system-specific FFmpeg (the FFmpeg 6 libraries have been copied to the ROCm dir). This means my original find may have been a red herring and I'm back to square one figuring out what's going wrong.

jeroen-mostert commented 1 week ago

Well, it took a while, but with gdb by my side I finally narrowed it down: the segfault happens when opt/rocm_sdk_.../lib64/libfftw3.so.3 needs to be loaded. When I mask this symlink and load the system's fftw, loading the ffmpeg library also succeeds. I don't yet know what's wrong there, that's chapter 2. A rebuild of amd-fftw did not fix things, in any case.

@jrl290: it would be nice if you could test if removing/renaming this symlink fixes things on your end too, to see if it's the same problem.

jeroen-mostert commented 6 days ago

The option that seems to cause offense (easy to find since it smelled suspiciously like something that could cause loading problems) is --enable-dynamic-dispatcher. When I remove this option from the FFTW double precision build, the library loads (and I've built the tests to verify it's working correctly). I don't know why the resulting library fails to work on my system when the option is set, or how much is lost without it (I have a Ryzen 5 7600, in case it matters). Experimentally I tried upgrading the repo to the latest tag (4.2) but this doesn't change anything; it still fails with dynamic dispatch enabled and succeeds without.

lamikr commented 6 days ago

@jeroen-mostert You beat me, nice catch!

I was finally able to reproduce the exact @jrl290's segfault on address 0x41b40 and according to strace it happened on mprotect call. I also tested building against the ubuntus ffmpeg headers and libraries and that did fixed the problem. I was just doing the torch audio debug build to trace torio_ffmpged library when I read your message.

--enable-dynamic-dispatcher option packages to my code for all different gpu's build to same library and the loading of that somehow now fails on Ubuntu. I removed the --enable-dynamic-dispatcher call from all 4 amd-fwd builds and it fixed the issue on Ubuntu 24.4 also for me.

Do you want to made a pull request from a patch that removed the "--enable-dynamic-dispatcher" option from all of these?

020_01_amd_fftw_single_precision.binfo
020_02_amd_fftw_double_precision.binfo 020_03_amd_fftw_long_double_precision.binfo 020_04_amd_fftw_quad_precision.binfo

jrl290 commented 6 days ago

Finally got through a fresh install, removed --enable-dynamic-dispatcher, and rebuilt

And it seems this problem is solved. Thank you!

Unfortunately, the GPU is still unstable during processing. The original hope for trying a gfx1102 build instead of gfx1100 (which AMD provides directly) was to correct the random GPU hangs that occur. Specifically:

HW Exception by GPU node-1 (Agent handle: 0x62ad7f0d8420) reason :GPU Hang

This is no doubt a larger problem. Not at all related to rocm_sdk_builder. But be forewarned that it has been occurring for me on the Ryzen 7840U. And any debugging tips would be welcome. I'm not quite as advanced in this area as you guys

jeroen-mostert commented 5 days ago

I have no specific advice, but it sounds like that should probably be a separate issue, with steps to reproduce if possible. Even if there's no (easy) fix it will confirm for others with a 7840U that they're not alone.