mapillary / mapillary_tools

Command line tools for processing and uploading Mapillary imagery
BSD 2-Clause "Simplified" License
256 stars 134 forks source link

generate select filter as binary search for less stack consumption during evaluation #673

Open rnissl opened 3 months ago

rnissl commented 3 months ago

The current approach of calculating the sum of all equality tests requires a lot of stack space as FFmpeg's evaluator uses recursion to handle the add. Hence, FFmpeg crashed with stack overflow when selecting a larger number of images.

Some numbers for selecting N images (executed per decoded image):

The patch implements a binary search algorithm to reduce stack consumption and also reduces the number of instruction to execute per decoded image:

ptpt commented 3 months ago

@rnissl Thanks a lot for the PR. I did notice some performance issues when sampling large videos but never thought it was the way we do filtering. I will take a look how ffmpeg filters work, and merge it ASAP.

rnissl commented 2 months ago

@ptpt Well, I don‘t think, that you‘ll find documentation about how FFmpeg executes expressions beside the code itself 😉

It‘s common sense how they do it, if one doesn‘t choose to evaluate while parsing the expression. Think of a good old calculator: when you type „1+2+“, the display already shows 3 before you continue typing, so it evaluates while parsing.

When one separates parsing and evaluating, an expression tree (Syntaxbaum) will be created once as result of parsing, and will be travelled any number of times for evaluation — in our case per decoded image.

Let‘s try a simple grammar for evaluating a sum, using the naming from the wiki (sorry, couldn‘t find an english version of it):

E :: E + T
     T

This means, an expression is either a expression plus a term or a term only.

If you feed „1+2+3+4“ into this parser, you‘ll get „((1+2)+3)+4“ as expression tree — I‘ve used brackets to illustrate the sub trees. During evaluation, FFmpeg starts with the rightmost +, picks the left operand, sees it‘s an expression, recurses and gets the +, picks the left operand, which is an expression too, recurses and gets a +, picks the left operand 1, picks the right operand 2, executes the add, returns the result 3, picks the right operand 3, executes the add, returns the result 6, picks the right operand 4, executes the add, and you get the final result 10 😀

As you can see from the brackets, the expression tree isn’t balanced, it actually degenerates to a list if one ignores the right operands. So for each element added, you’ll get an additional recursion and this will eat up the stack space of your program, hence limits the number of images you can successfully select without FFmpeg crashing 😕

I was first looking for a way to balance the tree like that „(1+2)+(3+4)“ — result is the same, but instead of N recursions there will only be ld(N). But FFmpegs expressions do not support brackets in parsing to structure the expression tree accordingly. Using an „if(1, …)“ could be used as a workaround for the missing brackets.

But when an „if“ has to be used, then a binary search can be implemented as well, avoiding to evaluate equality tests which will always be false.

rnissl commented 2 months ago

Visualization of mentioned expression tree Visualization of binary search based select filter

ptpt commented 2 months ago

@rnissl thanks for the insights, and it makes total sense.

Did you reproduce the stack overflow? and what ffmpeg setup/parameters (video size, # of samples, etc) did you start to notice the improvement?

rnissl commented 2 months ago

@ptpt This is the observed stack overflow in ffmpeg:

grafik

If you keep my simple expression tree in mind, it is currently walking down the left nodes (line 325). In line 326, it would evaluate the right node to finally perform the add in line 340.

The command line: C:\msys64\home\rnissl\FFmpeg\ffmpeg_g.exe -loglevel debug -c:v hevc_cuvid -crop 0x800x0x0 -hide_banner -nostdin -i ..\Input\2024-03-14\GX010001.MP4 -map 0:0 -filter_script:v E:\tmpxjsvt_3b -vsync 0 -frames:0 8676 -frame_pts 1 -qscale:0 2 E:\Output\x_%06d.jpg

The first time I've seen the stack overflow was for 8590 frames. At that time, I've hacked mapillary_tools and processed the file in two runs, processing only halve of the frames in question per run.

With my patch, 8676 frames will only cause 15 recursive calls -- far less than the 8677 which led to the crash. As roughly 4096 frames worked before, this means you'll now be able to select at least 2^4096 frames.

Regarding performance: My generation 1 core i7 965 from 2008 is more busy with demuxing the H.265 video (5312x2988) and jpeg compression, so I didn't notice a speed gain regarding select filter. As you can see from the command line, I even had to add nVidia hardware decoding to be able to decode in real time (actually 1.3x). Without, the job wasn't feasible (0.9x).