ultravideo / kvazaar

An open-source HEVC encoder
BSD 3-Clause "New" or "Revised" License
826 stars 178 forks source link

test_weird_shapes.sh runs forever on i386 #412

Closed fancycode closed 3 weeks ago

fancycode commented 1 month ago

While packaging kvazaar 2.3.0 / 2.3.1 for Debian I noticed the test "test_weird_shapes.sh" takes forever (i.e. must be killed after taking 100% CPU for multiple minutes without visible progress) when running on i386.

This is the output of tests/test_weird_shapes.sh.log:

``` $ ffmpeg -f lavfi -i mandelbrot=size=16x16 -vframes 10 -pix_fmt yuv420p -f yuv4mpegpipe /tmp/tmp.OuHYUAnbqq ffmpeg version 7.0.2-1 Copyright (c) 2000-2024 the FFmpeg developers built with gcc 14 (Debian 14.2.0-1) configuration: --prefix=/usr --extra-version=1 --toolchain=hardened --libdir=/usr/lib/i386-linux-gnu --incdir=/usr/include/i386-linux-gnu --arch=i386 --enable-gpl --disable-stripping --disable-libmfx --disable-omx --enable-gnutls --enable-libaom --enable-libass --enable-libbs2b --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libharfbuzz --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-openal --enable-opencl --enable-opengl --disable-sndio --disable-libvpl --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-ladspa --enable-libbluray --enable-libcaca --enable-libdvdnav --enable-libdvdread --enable-libjack --enable-libpulse --enable-librabbitmq --enable-librist --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libx264 --enable-libzmq --enable-libzvbi --enable-lv2 --enable-sdl2 --enable-libplacebo --enable-librav1e --enable-pocketsphinx --enable-librsvg --enable-libjxl --ignore-tests='hapenc-hap-none,hapenc-hapa-none,hapenc-hapq-none' --enable-shared libavutil 59. 8.100 / 59. 8.100 libavcodec 61. 3.100 / 61. 3.100 libavformat 61. 1.100 / 61. 1.100 libavdevice 61. 1.100 / 61. 1.100 libavfilter 10. 1.100 / 10. 1.100 libswscale 8. 1.100 / 8. 1.100 libswresample 5. 1.100 / 5. 1.100 libpostproc 58. 1.100 / 58. 1.100 Input #0, lavfi, from 'mandelbrot=size=16x16': Duration: N/A, start: 0.000000, bitrate: N/A Stream #0:0: Video: wrapped_avframe, rgb0, 16x16 [SAR 1:1 DAR 1:1], 25 fps, 25 tbr, 25 tbn Stream mapping: Stream #0:0 -> #0:0 (wrapped_avframe (native) -> wrapped_avframe (native)) Press [q] to stop, [?] for help Output #0, yuv4mpegpipe, to '/tmp/tmp.OuHYUAnbqq': Metadata: encoder : Lavf61.1.100 Stream #0:0: Video: wrapped_avframe, yuv420p(progressive), 16x16 [SAR 1:1 DAR 1:1], q=2-31, 200 kb/s, 25 fps, 25 tbn Metadata: encoder : Lavc61.3.100 wrapped_avframe [out#0/yuv4mpegpipe @ 0x56dddc00] video:3KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 28.441558% frame= 10 fps=0.0 q=-0.0 Lsize= 4KiB time=00:00:00.40 bitrate= 79.1kbits/s speed= 25x $ ../libtool execute ../src/kvazaar -i /tmp/tmp.OuHYUAnbqq --input-res=16x16 -o /tmp/tmp.7ag4ffNHao --threads=2 --owf=1 --preset=veryslow Using preset veryslow: --rd=3 --pu-depth-intra=1-4 --pu-depth-inter=0-3 --me=tz --gop=16 --ref=4 --bipred=1 --deblock=0:0 --signhide=1 --subme=4 --sao=full --rdoq=1 --rdoq-skip=0 --transform-skip=1 --mv-rdo=0 --full-intra-search=0 --smp=1 --amp=0 --cu-split-termination=zero --me-early-termination=off --intra-rdo-et=0 --early-skip=1 --fast-residual-cost=0 --max-merge=5 Compiled: INTEL, flags: Detected: INTEL, flags: MMX SSE SSE2 SSE3 SSSE3 SSE41 SSE42 AVX AVX2 Available: avx2(38) sse2(2) sse41(4) In use: avx2(38) sse2(1) Input: /tmp/tmp.OuHYUAnbqq, output: /tmp/tmp.7ag4ffNHao Video size: 16x16 (input=16x16) POC 0 QP 19 AVG QP 19.0 (I-frame) 4096 bits PSNR Y 44.1786 U 43.0120 V 42.4086 ```

The input and output files in '/tmp` are small:

-rw-r--r--  1 1234 1234  512 Aug  7 21:56 tmp.7ag4ffNHao
-rw-r--r--  1 1234 1234 3956 Aug  7 21:56 tmp.OuHYUAnbqq

Two stack traces while interrupted from running in gdb:

``` Thread 1 (Thread 0xf74bf540 (LWP 748141) "kvazaar"): #0 ver_sad_avx2 (pic_data=0xf5b02d47 "\240\2437hmk\235\244\251\252\247\231uf\\e\202:*+<\235\212\250\253\245\224up\202h\20232**17{\252\246\201`]\201\267\2227TH7/`\254~\252\240Ze\245\270\2001HOU`\263\245\214\177\251\252\235\204b\200\236+\300K_\213\251\267x{\245\252\251\241\215`W[\222Ry\230\231\257\245\202\242\250\252\252\244\226\207uZ\206\227\246\243\221\177\225\237\243\250\252\253\250\242\227\203c]_\231`\207\245\232\237\243\247\252\253\252\247\240\222q\237r\220\245\252\227\233\237\243\246\251\252\253\252\247\234\203\231\247\253\251\215\222\226\232\236\240\240\231\221\227\236\243\247\251\235\244\230\240", ref_data=0x565cd290 "\251\242\214`Z[\223Py\233\227\257\246\177\240\250\250\252\243\226\204uX\205\227\247\241\221\177\225\235\243\250\246\252\246\237\224\202e[c\231a\207\247\233\241\243\246\252\251\253\245\243\223r\236s\217\245\254\233\234\240\240\251\254\255\252\252\245\236\205\231\250\254\253\216\217\226\230\234\236\240\231\221\227\236\243\246\247\234\244\222\241\245\232\215\206w\221\240\247\214\216\201mp|\240~cFa]E\200\242\235p[O<(\227\227\242\241\233\202Bf\251\221\227\235\244\252\215\216\241\225\240", width=9, height=4, stride=16) at strategies/avx2/picture-avx2.c:1509 #1 0xf7f30742 in image_interpolated_sad (pic=0x565cebb0, ref=0x565cd0c0, pic_x=0, pic_y=0, ref_x=, ref_y=, block_width=16, block_height=, optimized_sad=0xf7f64090 ) at ./src/image.c:383 #2 kvz_image_calc_sad (pic=0x565cebb0, ref=0x565cd0c0, pic_x=0, pic_y=0, ref_x=-7, ref_y=4, block_width=16, block_height=16, optimized_sad=0xf7f64090 ) at ./src/image.c:439 --Type for more, q to quit, c to continue without paging-- #3 0xf7f49d12 in check_mv_cost (info=info@entry=0xfffd0c5c, x=-7, y=4, best_cost=0xfffd0b90, best_bits=0xfffd0b98, best_mv=0xfffd0ba0) at ./src/search_inter.c:212 #4 0xf7f4a156 in kvz_tz_pattern_search (info=0xfffd0c5c, pattern_type=0, iDist=, mv=..., best_dist=0xfffd0b88, best_cost=0xfffd0b90, best_bits=0xfffd0b98, best_mv=0xfffd0ba0) at ./src/search_inter.c:594 #5 0xf7f4b919 in tz_search (extra_mv=..., info=, best_cost=0xfffd0b90, best_bits=0xfffd0b98, best_mv=0xfffd0ba0) at ./src/search_inter.c:688 #6 search_pu_inter_ref (depth=, info=0xfffd0c5c, lcu=, cur_cu=, amvp=) at ./src/search_inter.c:1355 #7 search_pu_inter (state=state@entry=0x565ccb20, x_cu=x_cu@entry=0, y_cu=y_cu@entry=0, depth=2, part_mode=SIZE_2Nx2N, i_pu=0, lcu=0xfffe1180, amvp=0xfffd0f20, merge=0xfffd0ccc, info=0xfffd0c5c) at ./src/search_inter.c:1804 #8 0xf7f4da6f in kvz_search_cu_inter (state=0x565ccb20, x=0, y=0, depth=2, lcu=0xfffe1180, inter_cost=0xfffd1710, inter_bitcost=0xfffd1718) at ./src/search_inter.c:2188 #9 0xf7f4651f in search_cu (state=state@entry=0x565ccb20, x=x@entry=0, y=y@entry=0, depth=2, work_tree=0xfffd2110) at ./src/search.c:717 #10 0xf7f460eb in search_cu (state=state@entry=0x565ccb20, x=x@entry=0, y=y@entry=0, depth=, work_tree=) at ./src/search.c:963 #11 0xf7f460eb in search_cu (state=state@entry=0x565ccb20, x=x@entry=0, y=y@entry=0, depth=, work_tree=) at ./src/search.c:963 #12 0xf7f487dc in kvz_search_lcu (state=, x=, y=, hor_buf=, ver_buf=) at ./src/search.c:1214 #13 0xf7f253f0 in encoder_state_worker_encode_lcu (opaque=0x565ccf80) at ./src/encoderstate.c:659 #14 0xf7f26934 in encoder_state_encode_leaf (state=0x565ccb20) at ./src/encoderstate.c:800 #15 encoder_state_encode (main_state=main_state@entry=0x565ccb20) at ./src/encoderstate.c:1030 #16 0xf7f26f7b in encoder_state_worker_encode_children (opaque=0x565ccb20) at ./src/encoderstate.c:912 #17 0xf7f26889 in encoder_state_encode (main_state=main_state@entry=0x565cc260) at ./src/encoderstate.c:1022 #18 0xf7f26f7b in encoder_state_worker_encode_children (opaque=0x565cc260) at ./src/encoderstate.c:912 #19 0xf7f26889 in encoder_state_encode (main_state=main_state@entry=0x565c9be0) at ./src/encoderstate.c:1022 #20 0xf7f26ed4 in kvz_encode_one_frame (state=0x565c9be0, frame=0xf5b02ba0) at ./src/encoderstate.c:1664 #21 0xf7f367a8 in kvazaar_encode (enc=enc@entry=0x565660c0, pic_in=pic_in@entry=0x0, data_out=data_out@entry=0xffffcf14, len_out=0xffffcf34, pic_out=0xffffcf18, src_out=0xffffcf1c, info_out=0xffffd014) at ./src/kvazaar.c:275 #22 0xf7f36cda in kvazaar_field_encoding_adapter (enc=0x565660c0, pic_in=0x0, data_out=0xffffcf14, len_out=0xffffcf34, pic_out=0xffffcf18, src_out=0xffffcf1c, info_out=0xffffd014) at ./src/kvazaar.c:327 #23 0x565566e2 in main (argc=, argv=) at ./src/encmain.c:620 ```
``` #0 0xf7f306e9 in cor_sad ( ref_data=0x565cd290 "\251\242\214`Z[\223Py\233\227\257\246\177\240\250\250\252\243\226\204uX\205\227\247\241\221\177\225\235\243\250\246\252\246\237\224\202e[c\231a\207\247\233\241\243\246\252\251\253\245\243\223r\236s\217\245\254\233\234\240\240\251\254\255\252\252\245\236\205\231\250\254\253\216\217\226\230\234\236\240\231\221\227\236\243\246\247\234\244\222\241\245\232\215\206w\221\240\247\214\216\201mp|\240~cFa]E\200\242\235p[O<(\227\227\242\241\233\202Bf\251\221\227\235\244\252\215\216\241\225\240", pic_data=, block_width=15, block_height=8, pic_stride=) at ./src/image.c:258 #1 image_interpolated_sad (pic=0x565cebb0, ref=0x565cd0c0, pic_x=0, pic_y=0, ref_x=, ref_y=, block_width=16, block_height=, optimized_sad=0xf7f64090 ) at ./src/image.c:354 #2 kvz_image_calc_sad (pic=0x565cebb0, ref=0x565cd0c0, pic_x=0, pic_y=0, ref_x=-15, ref_y=8, block_width=16, block_height=16, optimized_sad=0xf7f64090 ) at ./src/image.c:439 #3 0xf7f49d12 in check_mv_cost (info=info@entry=0xfffd0c5c, x=-15, y=8, best_cost=0xfffd0b90, best_bits=0xfffd0b98, best_mv=0xfffd0ba0) at ./src/search_inter.c:212 #4 0xf7f4a156 in kvz_tz_pattern_search (info=0xfffd0c5c, pattern_type=0, iDist=, mv=..., best_dist=0xfffd0b88, best_cost=0xfffd0b90, best_bits=0xfffd0b98, best_mv=0xfffd0ba0) at ./src/search_inter.c:594 #5 0xf7f4b919 in tz_search (extra_mv=..., info=, best_cost=0xfffd0b90, best_bits=0xfffd0b98, best_mv=0xfffd0ba0) at ./src/search_inter.c:688 #6 search_pu_inter_ref (depth=, info=0xfffd0c5c, lcu=, cur_cu=, amvp=) at ./src/search_inter.c:1355 #7 search_pu_inter (state=state@entry=0x565ccb20, x_cu=x_cu@entry=0, y_cu=y_cu@entry=0, depth=2, part_mode=SIZE_2Nx2N, i_pu=0, lcu=0xfffe1180, amvp=0xfffd0f20, merge=0xfffd0ccc, info=0xfffd0c5c) at ./src/search_inter.c:1804 #8 0xf7f4da6f in kvz_search_cu_inter (state=0x565ccb20, x=0, y=0, depth=2, lcu=0xfffe1180, inter_cost=0xfffd1710, inter_bitcost=0xfffd1718) at ./src/search_inter.c:2188 #9 0xf7f4651f in search_cu (state=state@entry=0x565ccb20, x=x@entry=0, y=y@entry=0, depth=2, work_tree=0xfffd2110) at ./src/search.c:717 #10 0xf7f460eb in search_cu (state=state@entry=0x565ccb20, x=x@entry=0, y=y@entry=0, depth=, work_tree=) at ./src/search.c:963 #11 0xf7f460eb in search_cu (state=state@entry=0x565ccb20, x=x@entry=0, y=y@entry=0, depth=, work_tree=) at ./src/search.c:963 #12 0xf7f487dc in kvz_search_lcu (state=, x=, y=, hor_buf=, ver_buf=) at ./src/search.c:1214 #13 0xf7f253f0 in encoder_state_worker_encode_lcu (opaque=0x565ccf80) at ./src/encoderstate.c:659 #14 0xf7f26934 in encoder_state_encode_leaf (state=0x565ccb20) at ./src/encoderstate.c:800 --Type for more, q to quit, c to continue without paging-- #15 encoder_state_encode (main_state=main_state@entry=0x565ccb20) at ./src/encoderstate.c:1030 #16 0xf7f26f7b in encoder_state_worker_encode_children (opaque=0x565ccb20) at ./src/encoderstate.c:912 #17 0xf7f26889 in encoder_state_encode (main_state=main_state@entry=0x565cc260) at ./src/encoderstate.c:1022 #18 0xf7f26f7b in encoder_state_worker_encode_children (opaque=0x565cc260) at ./src/encoderstate.c:912 #19 0xf7f26889 in encoder_state_encode (main_state=main_state@entry=0x565c9be0) at ./src/encoderstate.c:1022 #20 0xf7f26ed4 in kvz_encode_one_frame (state=0x565c9be0, frame=0xf5b02ba0) at ./src/encoderstate.c:1664 #21 0xf7f367a8 in kvazaar_encode (enc=enc@entry=0x565660c0, pic_in=pic_in@entry=0x0, data_out=data_out@entry=0xffffcf14, len_out=0xffffcf34, pic_out=0xffffcf18, src_out=0xffffcf1c, info_out=0xffffd014) at ./src/kvazaar.c:275 #22 0xf7f36cda in kvazaar_field_encoding_adapter (enc=0x565660c0, pic_in=0x0, data_out=0xffffcf14, len_out=0xffffcf34, pic_out=0xffffcf18, src_out=0xffffcf1c, info_out=0xffffd014) at ./src/kvazaar.c:327 #23 0x565566e2 in main (argc=, argv=) at ./src/encmain.c:620 ```

The problem doesn't happen on amd64 or most other platforms (see https://buildd.debian.org/status/logs.php?pkg=kvazaar&ver=2.3.0-1&suite=sid).

Is this expected or do you have any suggestions on how to resolve this?

fador commented 1 month ago

I don't think that's expected but also I don't think it should use AVX2 for an i386 platform so something weird going on 😅

Jovasa commented 1 month ago

Are you using automake or cmake for the building?

fancycode commented 1 month ago

Are you using automake or cmake for the building?

I'm using automake, see here for the output from a run on the Debian buildd: https://buildd.debian.org/status/fetch.php?pkg=kvazaar&arch=i386&ver=2.3.0-1&stamp=1715809980&raw=0

I'll try to find steps how this can be reproduced locally easily, i.e. without building the Debian package.

fancycode commented 1 month ago

I don't think that's expected but also I don't think it should use AVX2 for an i386 platform so something weird going on 😅

The logs above were from my local (amd64) machine where I built and ran the 32bit package, maybe that's why it detected AVX2? Running the 32bit compiled code on amd64 should work fine, at least the other tests are passing.

fancycode commented 1 month ago

Here are the steps to reproduce this locally in a 32bit Docker container:

docker run --rm -ti i386/debian:sid /bin/bash

Then inside Docker:

apt-get update && apt-get install -y --no-install-recommends \
  autoconf \
  automake \
  build-essential \
  ca-certificates \
  ffmpeg \
  g++ \
  gcc \
  git \
  hm \
  libtool \
  make

git clone https://github.com/ultravideo/kvazaar.git
cd kvazaar
./autogen.sh

./configure --build=i686-linux-gnu
make -j$(nproc)
make check

While this hangs, you can connect to the running container from a second terminal and use gdb to inspect the process.

fancycode commented 1 month ago

Slightly related: is there a flag to show the output of the tests while they are running instead of just writing to the test log files?

fador commented 1 month ago

I think Automake uses only logs for the tests because of parallel test running. But you can always run cat tests/*.log afterwards =)

I pushed a fix 6bd23734070655610cd415ae3873a5b043198e33 that might solve the issue. It should not have been a problem to align buffers but for some reason it failed on the i386 in some cases..

fancycode commented 1 month ago

Thanks, unfortunately this didn't help and the test is still running with 100% CPU until cancelled. Do the steps described above work for you with latest master (i.e. running in 32bit Docker)?

fador commented 1 month ago

strange, I did test things with the Docker container but I was using --disable-shared and it was causing some other similar problem 😅 Now when testing again I did found another problem with comparing double values, I'm pushing a fix soon

fador commented 1 month ago

Now there's another fix 2f9a21423a1ef666c93f15e2068bcde979d20f7e

fancycode commented 1 month ago

Now there's another fix 2f9a214

Thanks! With only this applied on an otherwise unmodified 2.3.1, I can no longer reproduce the problem locally. I'll prepare an update for the Debian packaging to see if this also fixes the problem on the other architectures.

fancycode commented 3 weeks ago

I'll prepare an update for the Debian packaging to see if this also fixes the problem on the other architectures.

The update is building fine now on the Debian build infrastructure, is available in Debian unstable and should be available in testing in the next couple of days.

Thanks for the fix!