avfilter/tonemapx: add simd optimized tonemapx

gnattu commented 3 months ago

This includes NEON for ARMv8, SSE for x86-64-v2 and AVX+FMA for x86-64-v3

Test result with 4K HEVC 10bit HLG input, encoding with libx264 veryfast using bt2390:

Intel Core i9-12900:

tonemapx.c: 57fps tonemapx.sse: 74fps tonemapx.avx: 77fps

Apple M1 Max:

tonemapx.c:43fps tonemapx.neon: 57fps

For comparison, original zscale+tonemap simd results:

Intel Core i9-12900:

tonemap.avx: 40fps tonemap.sse: 40fps tonemap.c: 32fps

Apple M1 Max:

tonemap.neon: 44fps tonemap.c: 35fps

The original implementation is too memory heavy that dual-channel desktop CPUs are easily memory bounded due to the intermediate RGBF32 framebuffer sharing with zscale. Tonemapx lowered the the bandwidth requirement which brings significant performance gain to bandwidth limited platforms. Even for bandwidth-rich M1 Max it still provides significant performance boost due to better cache hitrate.

Changes

Fixed an issue that the input parameter is not really initialized
Added neon, avx and sse code path
Use function pointer instead of branching in each thread for tonemap function selection

Issues

Replaces #401

nyanmisaka commented 3 months ago

Got 74fps with the tonemapx.avx on 5950x on Windows. I'm quite pleased with the performance.

Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input.

gnattu commented 3 months ago

Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input.

Probably not that much for bandwidth rich systems as this conversion is mainly memory operation not compute operation, could be useful for bandwidth constrained platforms if we process yuv420p directly as the frame copy is further reduced and our avx implementation could see more improvements.

nyanmisaka commented 3 months ago

Further adding yuv420p10 input and yuv420p output support should squeeze out more FPS, because IIRC yuv420p1x to p01x conversion in FFmpeg is written in C, and other encoders such as libx265 and svt-av1 do not support nv12 input.

Probably not that much for bandwidth rich systems as this conversion is mainly memory operation not compute operation, could be useful for bandwidth constrained platforms if we process yuv420p directly as the frame copy is further reduced and our avx implementation could see more improvements.

We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10 -vframes 1000 -f null -

frame= 1000 fps=251 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=4.19x

./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10,format=p010 -vframes 1000 -f null -

frame= 1000 fps=168 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=2.81x

gnattu commented 3 months ago

We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.
./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10 -vframes 1000 -f null -
frame= 1000 fps=251 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=4.19x
./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10,format=p010 -vframes 1000 -f null -
frame= 1000 fps=168 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=2.81x

Do you have time for a reference c implementation of yuv420p? I can port SIMD version to it.

nyanmisaka commented 3 months ago

We really can't expect all our users to have i9/R9/Apple Silicon Max. FWIW in this simple test it already clearly hurts performance, although it wouldn't be so noticeable in a real world testing because this version is already much better than the zscale+tonemap.orig.
./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10 -vframes 1000 -f null -
frame= 1000 fps=251 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=4.19x
./ffmpeg -threads 16 -i 4K_HLG.MP4 -an -sn -vf format=yuv420p10,format=p010 -vframes 1000 -f null -
frame= 1000 fps=168 q=-0.0 Lsize=N/A time=00:00:16.71 bitrate=N/A speed=2.81x
Do you have time for a reference c implementation of yuv420p? I can port SIMD version to it.

Not yet. But here are some examples. Just convert it to use in a for loop. The difference between nv12/p01x and yuv420p/yuv420p1x is only whether U and V are interleaved.

https://github.com/jellyfin/jellyfin-ffmpeg/blob/44e1bf07446cb2a280ed55a08f303a1014ab5bfe/debian/patches/0005-add-cuda-tonemap-impl.patch#L745-L883

gnattu commented 3 months ago

gnu.org is down for 9 hours now and failing the pipeline

nyanmisaka commented 3 months ago

ffmpeg crashes after inserting a downscaling filter and running for a while. scale=s=1920x1080:flags=fast_bilinear,tonemapx=...

For example, these resolutions. 1280x720 1920x1080 1920x1088 2560x1440

It can't be reproduced in tonemapx.c. I guess it has something to do with the edges of the image that cannot be accelerated by SIMD.

- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="Application Error" /> 
  <EventID Qualifiers="0">1000</EventID> 
  <Version>0</Version> 
  <Level>2</Level> 
  <Task>100</Task> 
  <Opcode>0</Opcode> 
  <Keywords>0x80000000000000</Keywords> 
  <TimeCreated SystemTime="2024-06-26T02:48:40.9359314Z" /> 
  <EventRecordID>216441</EventRecordID> 
  <Correlation /> 
  <Execution ProcessID="0" ThreadID="0" /> 
  <Channel>Application</Channel> 
  <Computer>pc</Computer> 
  <Security /> 
  </System>
- <EventData>
  <Data>ffmpeg.exe</Data> 
  <Data>0.0.0.0</Data> 
  <Data>667ad11b</Data> 
  <Data>avfilter-9.dll</Data> 
  <Data>9.3.100.0</Data> 
  <Data>667ad11b</Data> 
  <Data>c0000005</Data> 
  <Data>000000000032dab4</Data> 
  <Data>3818</Data> 
  <Data>01dac77359173dbe</Data> 
  <Data>C:\Users\usr\Desktop\jellyfin-ffmpeg_6.0.1-7-portable_win64\ffmpeg.exe</Data> 
  <Data>C:\Users\usr\Desktop\jellyfin-ffmpeg_6.0.1-7-portable_win64\avfilter-9.dll</Data> 
  <Data>7b50b856-5989-4ef4-ad9f-03ded7842dfe</Data> 
  <Data /> 
  <Data /> 
  </EventData>
  </Event>

gnattu commented 3 months ago

How long did you let it run? I cannot reproduce on my machine.

gnattu commented 3 months ago

interesting, only occurs with windows builds

nyanmisaka commented 3 months ago

It happens randomly. Not sure if this has to do with gcc options and versions.

https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/Dockerfile.win64.in#L20 https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/builder/images/base-win64/Dockerfile#L43

Maybe try another builder? ./builder/makeimage.sh win64 gpl && ./builder/build.sh win64 gpl

gnattu commented 3 months ago

It happens randomly.

The issue occurs frequently on Windows but never on Linux. Windows reports an access violation, which doesn't make sense to me if all memory accesses are to legal locations on Linux.

nyanmisaka commented 3 months ago

It happens randomly.

The issue occurs frequently on Windows but never on Linux. Windows reports an access violation, which doesn't make sense to me if all memory accesses are to legal locations on Linux.

Does the compiler emit the same assembly code from the intrinsics?

gnattu commented 3 months ago

./builder/makeimage.sh win64 gpl gives me 18.92 ../src/meson.build:28:12: ERROR: Program 'llvm-dlltool dlltool' not found or not executable when building libplacebo...

nyanmisaka commented 3 months ago

./builder/makeimage.sh win64 gpl gives me 18.92 ../src/meson.build:28:12: ERROR: Program 'llvm-dlltool dlltool' not found or not executable when building libplacebo...

I'll take a look at it later. You can drop the vulkan related scripts first.

gnattu commented 3 months ago

The problem is now super stupid to me. I commented out both read and write to/from the framebuffer operation and it is still telling me access violation.

gnattu commented 3 months ago

Well I found the cause. GCC generated code sequences from _mm256_extract_epi32 and _mm_extract_epi32 intrinsics includes optimizations that do not play well with windows and will cause access violation. By storing the whole register to memory and read from there workarounds the access violation on windows.

nyanmisaka commented 3 months ago

Well I found the cause. GCC generated code sequences from _mm256_extract_epi32 and _mm_extract_epi32 intrinsics includes optimizations that do not play well with windows and will cause access violation. By storing the whole register to memory and read from there workarounds the access violation on windows.

I guess this uncertainty from the compiler is one of the reasons why upstream FFmpeg only accepts assembly code.

gnattu commented 3 months ago

Well it turns out that it is not that simple.

Now I really believe it is due to gcc+windows. I can workaround this issue by reducing the -filter_threads to a lower value like 8. When this value is set to something like 16 or higher, it is super easy to trigger the access violation again.

To make it even worse, it seems like only cap the concurrency in tonemapx is not enough and this has to be a global option which means all filters in the chain has to be concurrency capped.

With a global concurrency of 24 and we only spawn 1 job for tonemapx, the access violation still happens after a few moments.

nyanmisaka commented 3 months ago

Well it turns out that it is not that simple.

Now I really believe it is due to gcc+windows. I can workaround this issue by reducing the -filter_threads to a lower value like 8. When this value is set to something like 16 or higher, it is super easy to trigger the access violation again.

To make it even worse, it seems like only cap the concurrency in tonemapx is not enough and this has to be a global option which means all filters in the chain has to be concurrency capped.

With a global concurrency of 24 and we only spawn 1 job for tonemapx, the access violation still happens after a few moments.

Take it easy. We still have several months to investigate before JF 10.10.

Could it be related to LTO and GCC auto-vectorization? https://github.com/jellyfin/jellyfin-ffmpeg/blob/jellyfin/debian/patches/0035-enable-gcc-vectorization-and-lto-auto.patch

gnattu commented 3 months ago

__attribute__((optimize("-fno-tree-vectorize"))) on simd functions made little to no difference

What works is that by implementing the write whole block back to memory and read from there(no usage of _mm256_extract_epi32) plus a global threads cap of 8 and a normal video resolution (like 1920x1080), and an abnormal output of 1928x1080 requires even lower thread limit to 4.

I however observed that zscale can support a huge amount of concurrency (with 1920x1080 works with 24 and 1928x1080 works with 16) after the _mm256_extract_epi32 workarounds being implemented. Maybe ffmpeg's native scaling filter is having some problem?

Edit: it seems like zscale works even without modification? At least it works with -filter_threads 16 and all the resolution combination you mentioned.

nyanmisaka commented 3 months ago

Have you seen these? MingGW-W64 seems to be quite fragile in handling AVX.

https://stackoverflow.com/questions/71859992/what-is-causing-this-memory-access-violation-error-0xc0000005-when-using-eigen

https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

edit: If this is the reason, a workaround is to add the compiler option -Wa,-muse-unaligned-vector-move to mingw.

https://github.com/search?q=use-unaligned-vector-move&type=code

gnattu commented 3 months ago

Have you seen these? MingGW-W64 seems to be quite fragile in handling AVX.

https://stackoverflow.com/questions/71859992/what-is-causing-this-memory-access-violation-error-0xc0000005-when-using-eigen

https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

Well it is worse than I think. The problem now is that the sse version also has such problems and I have no idea why. It does support higher concurrency though, but when the threads is too much the access violation will eventually come even with only sse. Still, zscale works better.

nyanmisaka commented 3 months ago

Have you seen these? MingGW-W64 seems to be quite fragile in handling AVX. https://stackoverflow.com/questions/71859992/what-is-causing-this-memory-access-violation-error-0xc0000005-when-using-eigen https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

Well it is worse than I think. The problem now is that the sse version also has such problems and I have no idea why. It does support higher concurrency though, but when the threads is too much the access violation will eventually come even with only sse. Still, zscale works better.

Perhaps you should try compiling with MSVC to see if this is just another mingw gcc issue. https://github.com/ShiftMediaProject/FFVS-Project-Generator

gnattu commented 3 months ago

Perhaps you should try compiling with MSVC to see if this is just another mingw gcc issue. https://github.com/ShiftMediaProject/FFVS-Project-Generator

Will figure it out later. -Wa,-muse-unaligned-vector-move does not help so this is another issue i think. But I still removed avx on windows due to its known bug.

gnattu commented 3 months ago

The access violation with many threads even happens with msvc(btw our codebase is not really compatible with msvc and have to comment out a lot of things to make it compile). Perhaps I have to try that project generator to debug with visual studio to see what happens...

gnattu commented 3 months ago

I "fixed" the access violation on Windows by refactoring the memory store logic and using more stable range clipping.

The new memory store logic improved performance for all platforms:

i9-12900 with AVX2: 77fps->87fps (Linux) M1 Max with NEON: 57fps->60fps

Now AVX is usable on Windows with compiler flag -Wa,-muse-unaligned-vector-move. The performance impact of this flag is around 5%. i9-12900 got 82fps on Windows.

gnattu commented 3 months ago

A question about the yuv420p implementation: If most of the software decoder and encoders are not expecting p01x frames, isn't it safe to just drop the support for such frames and supports yuv420 exclusively? For what use case the p01x is preferred?

nyanmisaka commented 3 months ago

A question about the yuv420p implementation: If most of the software decoder and encoders are not expecting p01x frames, isn't it safe to just drop the support for such frames and supports yuv420 exclusively? For what use case the p01x is preferred?

It is safe. The original idea was to fall back to software filters for hybrid transcoding when the iGPU ran out of power but still had spare capacity for hardware decoding and encoding.

nyanmisaka commented 3 months ago

I "fixed" the access violation on Windows by refactoring the memory store logic and using more stable range clipping.

The new memory store logic improved performance for all platforms:

i9-12900 with AVX2: 77fps->87fps (Linux) M1 Max with NEON: 57fps->60fps

Now AVX is usable on Windows with compiler flag -Wa,-muse-unaligned-vector-move. The performance impact of this flag is around 5%. i9-12900 got 82fps on Windows.

That's incredible. Will applying this Wa globally affect the performance of other programs? It would be nice if we could add flags only for tonemapx.

gnattu commented 3 months ago

Will applying this Wa globally affect the performance of other programs?

This only affects ffmpeg itself and not other dependencies, and for ffmpeg itself, tonemapx is the only one that allows generating vectorized avx and most of the performance critical path is written with ASM in ffmpeg itself and using external libraries which does not have this applied. The only scenario that might be impacted by this is that the code is in ffmpeg and written in c, and gcc auto vectorized it using SSE2 and now the aligned load becomes unaligned. Even like this the perf impact should be negligible.

It would be nice if we could add flags only for tonemapx.

It seems like there is no trivial way to do this. Do you know how could I do that?

gnattu commented 3 months ago

Added direct support for yuv420p to eliminate the frame format conversion for software decoder/encoders. The performance of i9-12900 boosted to 95fps for 4K to 4K tonemapping on Linux. The M1 Max does not see a significant performance boost with this change (only around a 5% increase to 63fps), which is expected as it has three times the memory bandwidth of the i9. Therefore, reducing memory copy will not benefit it as much.

nyanmisaka commented 3 months ago

Added direct support for yuv420p to eliminate the frame format conversion for software decoder/encoders. The performance of i9-12900 boosted to 95fps for 4K to 4K tonemapping on Linux. The M1 Max does not see a significant performance boost with this change (only around a 5% increase to 63fps), which is expected as it has three times the memory bandwidth of the i9. Therefore, reducing memory copy will not benefit it as much.

The 12900 performs reasonably, while M1 Max seems to be bounded by its computing power/core counts.

Now LOC has expanded to 5 times of pure C impl. How about merging some duplicate lines? For example, use macro to merge 10-to-10 and 10-to-8 funcs.

gnattu commented 3 months ago

The 12900 performs reasonably, while M1 Max seems to be bounded by its computing power/core counts.

Yes, in a VM with only 8P, the 12900 performs at about the same speed as the M1 Max

How about merging some duplicate lines? For example, use macro to merge 10-to-10 and 10-to-8 funcs.

The problem with macros is that it hurts readability a lot and I really want to avoid using that, I know the upstream ffmpeg uses this a lot but their code is painful to read.

nyanmisaka commented 3 months ago

The 12900 performs reasonably, while M1 Max seems to be bounded by its computing power/core counts.

Yes, in a VM with only 8P, the 12900 performs at about the same speed as the M1 Max

How about merging some duplicate lines? For example, use macro to merge 10-to-10 and 10-to-8 funcs.

The problem with macros is that it hurts readability a lot and I really want to avoid using that, I know the upstream ffmpeg uses this a lot but their code is painful to read.

So how about separating the different variants of simd code into multiple files tonemapx_intrin_{sse,avx,neon}.c? It's also painful to read them without a code editor (folding unrelated funcs).

gnattu commented 3 months ago

So how about separating the different variants of simd code into multiple files tonemapxintrin{sse,avx,neon}.c? It's also painful to read them without a code editor (folding unrelated funcs).

This sounds better to me

nyanmisaka commented 3 months ago

There are also some warnings that can be eliminated before merging. There should be more in the GH actions log.

For example:

libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_p016_p010_2_nv12’:
libavfilter/vf_tonemapx.c:495:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  495 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_420p10_2_420p’:
libavfilter/vf_tonemapx.c:585:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  585 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_420p10_2_420p10’:
libavfilter/vf_tonemapx.c:675:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  675 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘tonemap_frame_p016_p010_2_p016_p010’:
libavfilter/vf_tonemapx.c:767:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  767 |             int r00 = r[0], g00 = g[0], b00 = b[0];
      |             ^~~
libavfilter/vf_tonemapx.c: In function ‘filter_slice_planar10’:
libavfilter/vf_tonemapx.c:851:43: warning: passing argument 1 of ‘s->tonemap_func_planar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  851 |     s->tonemap_func_planar10(out->data[0] + out->linesize[0] * slice_start,
      |                              ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                           |
      |                                           uint8_t * {aka unsigned char *}
libavfilter/vf_tonemapx.c:851:43: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c:852:43: warning: passing argument 2 of ‘s->tonemap_func_planar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  852 |                              out->data[1] + out->linesize[1] * AV_CEIL_RSHIFT(slice_start, desc->log2_chroma_h),
libavfilter/vf_tonemapx.c:852:43: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c:853:43: warning: passing argument 3 of ‘s->tonemap_func_planar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  853 |                              out->data[2] + out->linesize[2] * AV_CEIL_RSHIFT(slice_start, desc->log2_chroma_h),
libavfilter/vf_tonemapx.c:853:43: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c: In function ‘filter_slice_biplanar10’:
libavfilter/vf_tonemapx.c:870:45: warning: passing argument 1 of ‘s->tonemap_func_biplanar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  870 |     s->tonemap_func_biplanar10(out->data[0] + out->linesize[0] * slice_start,
      |                                ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                             |
      |                                             uint8_t * {aka unsigned char *}
libavfilter/vf_tonemapx.c:870:45: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}
libavfilter/vf_tonemapx.c:871:45: warning: passing argument 2 of ‘s->tonemap_func_biplanar10’ from incompatible pointer type [-Wincompatible-pointer-types]
  871 |                                out->data[1] + out->linesize[1] * AV_CEIL_RSHIFT(slice_start, desc->log2_chroma_h),
libavfilter/vf_tonemapx.c:871:45: note: expected ‘uint16_t *’ {aka ‘short unsigned int *’} but argument is of type ‘uint8_t *’ {aka ‘unsigned char *’}

nyanmisaka commented 3 months ago

It seems that ffmpeg has a HAVE_INTRINSICS_NEON generated from configure. Is it necessary to add one for SSE and AVX as well? Just for completeness, in case our downstream users (e.g. SynoCommunity) are using legacy compilers and resulting in breakages.

edit:

Disable corresponding SIMD code when header files or funcs are not available
When disabling SIMD using ffmpeg flag, the corresponding code should be excluded --disable-neon
Runtime SIMD check

gnattu commented 3 months ago

It seems that ffmpeg has a HAVE_INTRINSICS_NEON generated from configure. Is it necessary to add one for SSE and AVX as well? Just for completeness, in case our downstream users (e.g. SynoCommunity) are using legacy compilers and resulting in breakages.

edit:

Disable corresponding SIMD code when header files or funcs are not available

When disabling SIMD using ffmpeg flag, the corresponding code should be excluded --disable-neon

Runtime SIMD check

I couldn't care less about those ancient compilers. I even dropped debian buster support due to its ancient gcc. Extend the macro checking for configure flag and disable with --disable-neon is probably all I'm going to do. Check for header existence is out of scope at least to me, and checking for function availability is almost impossible unless you write a super long configure script to test compile the whole file and then define the disable intrinsics macro, but I'm not going to do that

nyanmisaka commented 3 months ago

It seems that ffmpeg has a HAVE_INTRINSICS_NEON generated from configure. Is it necessary to add one for SSE and AVX as well? Just for completeness, in case our downstream users (e.g. SynoCommunity) are using legacy compilers and resulting in breakages. edit:

Disable corresponding SIMD code when header files or funcs are not available

When disabling SIMD using ffmpeg flag, the corresponding code should be excluded --disable-neon

Runtime SIMD check

I couldn't care less about those ancient compilers. I even dropped debian buster support due to its ancient gcc. Extend the macro checking for configure flag and disable with --disable-neon is probably all I'm going to do. Check for header existence is out of scope at least to me, and checking for function availability is almost impossible unless you write a super long configure script to test compile the whole file and then define the disable intrinsics macro, but I'm not going to do that

Indeed, it is not practical to test all functions. How about checking with compiler version? When the gcc/clang version is lower than required, disable SIMD. Other compilers can be reasonably ignored.

gnattu commented 3 months ago

Indeed, it is not practical to test all functions. How about checking with compiler version? When the gcc/clang version is lower than required, disable SIMD. Other compilers can be reasonably ignored.

Done

gnattu commented 3 months ago

I made a windows clang build for testing: https://github.com/gnattu/jellyfin-ffmpeg/releases/tag/win64-clang

This performs faster than the gcc version(at least on my own machine), but more testing is needed.

nyanmisaka commented 3 months ago

0080-add-simd-optimized-tonemapx-filter.patch

I made some minor improvements:

Briefly test immtrin.h availability instead of relying entirely on compiler version
Comment out intrin code when disabling intrinsics, to prevent them from still being compiled and possibly throwing errors
Fix warning: unused variable 'cpu_flags' when intrinsics is completely disabled

Two Q:

Any chance to re-enable intrinsics for gcc 9? Our focal/20.04 build still uses it.
Seems that immintrin.h already includes emmintrin.h + smmintrin.h, maybe just including immintrin.h is enough?

gnattu commented 3 months ago

Briefly test immtrin.h availability instead of relying entirely on compiler version

This seems unnecessary because you have to test compiler version after all. GCC has immtrin.h since 4.x (forget which precisely), but what is available in that header file varies largely which means intrinsics will be unavailable even with the presence of this file and you cannot test all instructions. Unless there is a stupid compiler that lies about itself and pretend to be a modern gcc/clang and does not implement x86 intrinsics.

Any chance to re-enable intrinsics for gcc 9? Our focal/20.04 build still uses it.

It should compile, though I cannot guarantee if it really works.

Seems that immintrin.h already includes emmintrin.h + smmintrin.h, maybe just including immintrin.h is enough?

This should work on all modern compilers, not sure for ancient gcc

nyanmisaka commented 3 months ago

This seems unnecessary because you have to test compiler version after all. GCC has immtrin.h since 4.x (forget which precisely), but what is available in that header file varies largely which means intrinsics will be unavailable even with the presence of this file and you cannot test all instructions. Unless there is a stupid compiler that lies about itself and pretend to be a modern gcc/clang and does not implement x86 intrinsics.

At least this ensures that the current gcc/clang environment is capable of handling some kind of intrinsics before actually building ffmpeg, no?

If immintrin.h or one of its included headers is missing, or the package containing it is not installed or is corrupted, then simply checking the version of gcc/clang is not sufficient.

As for whether the contents of the headers are out of date, I think that is beyond our scope. If it's really necessary, we can add them manually, just like you did for _mm_storeu_si32() in gcc10-.

It should compile, though I cannot guarantee if it really works.

20.04 will reach EOL in April 2025. We can try enabling it and see.

This should work on all modern compilers, not sure for ancient gcc

I think we don't care about those ancient compilers.

jellyfin / jellyfin-ffmpeg

avfilter/tonemapx: add simd optimized tonemapx #407