Maximum window size is 3 milliseconds

Since commit fa5dcc82767650936ddcae36c0b4eb4d9d802968 ebur128_set_max_window always fails with EBUR128_ERROR_NOMEM and ebur128_loudness_window fails with EBUR128_ERROR_INVALID_MODE. This is caused by the compile-time-constant VALIDATE_MAX_WINDOW evaluating to 3, i.e. the maximum allowed window size is a mere 3ms.

It seems that VALIDATE_MAX_WINDOW is at least missing a factor one thousand to convert from seconds to milliseconds. Even then, VALIDATE_MAX_WINDOW is relatively small -- rightly so, since for the given maximum sample rate and channel count one second of audio requires almost one gigabyte of data.

I propose one of the following solutions:

Reduce the maximum sample rate

Base the maximum window size on the actual sample size, i.e.

#define VALIDATE_MAX_WINDOW                                                    \
((3000ull << 30) / st->samplerate / st->channels / sizeof(double))

I think the latter solution is the way to go. However, it may be beneficial to store the max window size in the state struct for performance reasons.

Good catch! What is a good limit for max_window? I'm unsure what people use in practice here. Are 30 seconds enough?

Lets say we have a 30 second window, 32 channels and a sample rate of 384000. The audio_data array would have 30 * 32 * 384000 * sizeof(double) bytes of data which is nearly 3GB. This is huge, but won't overflow on 32 bit systems.

I guess we could calculate all limits dynamically, but this would complicate the validation logic. I would like to avoid that if possible.

What is a good limit for max_window?

In Shotcut, we allow a window range of 2-600 seconds. But I think we could reasonably limit that to 30 seconds and still cover most video editing use cases.

I haven't looked at the code (just glanced at the header file in question) so if I'm speaking out of my ass, my apologies for that.

1000ms (1 second) should be enough, which would be 98304000bytes total if 32channels of 64bit floats at 384000hz.

This is loudness scanning not audio processing (like reverb or echo etc) so there is no need to "remember" more than 1 second at a time. A video editor like Shotcut might use 32 channels of audio but a DAW might use way more (hundreds), sure with 64bit OS and executables this is less of an issue but could easily exhaust the memory for pro-sumers, there is no point using more memory than absolutely needed.

On a private project I calculate the RMS on audio and there I calculate the RMS over the buffer length (which could be 1 second or only 100ms for example), then the result of that is summed with the previous, after the scanning is over the sum is divided by the number of buffers scanned. This means that during scanning regardless of buffer size only 8 bytes (64bit float aka double) is used per channel, it's hard to get more memory efficient than that.

Sure, there is some precision loss when you start summing thousands or millions of buffers (hours or days) of audio buffers, but at that stage the RMS value (or in the case of EBU R-128 integrated) should be fairly stable and within margin of error anyway. The longest audio would be continuous double albums (1-2hrs) or long films or documentaries (3-5 hours) but those are rare, things instead tend to be split into parts at that point usually. The True Peak is even easier as you only check if the current (true) peak is higher than the stored one, and in this case there is no precision loss due to summing not being needed.

Another solution (though this is more lossy) is to resample to 48kHz then do the 4x true peak oversampling on that. That would simplify code and use less memory, but no clue if that break EBU R-128 specs or not. But for use in histograms it may be acceptable, the user jut need a visual approximation, if things are off by a pixel visually it does not matter as long as the numbers are correct or within accepted margin of error.

Edit: Do note that the way I do the RMS (I'm describing from memory here) is that I skip over the value 0.0 (silence) to save some CPU cycles, silence is also meaningless in the numbers as well. I then fetch the sample point and check the peak, if the current peak is higher then I update the max peak value for the current buffer. I then square the value and add it to the buffer sum. Once the buffer loop is done I calculate the average to get the mean then I get the root of that (result in a RMS value) which is returned to the parent calling procedure where the RMS is summed and when all audio buffers are done the average is calculated, the max peak is just carried over from buffer to buffer. This keeps all variables in the CPU registers when possible (read into registers just before the buffer loop) and makes a very tight loop. I also "cheat" by simply doing it sample by sample with channels interleaved (which is very common) so the code is essentially channel agnostic, it could be 1 channel or 1024 channels, the code loop is the same.

1000ms (1 second) should be enough, which would be 98304000bytes total if 32channels of 64bit floats at 384000hz.

In MLT (and therefore Shotcut), the windowed loudness measurement is an input into an Automatic Gain Control algorithm which uses that measurement to correct the audio loudness. The speed of the gain control is user configurable by setting the size of the window. 1 second is too short for most applications like this. I think that 30 seconds (which I offered previously) would be the smallest duration that I would want to offer as the "maximum window size".

On a private project I calculate the RMS on audio and there I calculate the RMS over the buffer length (which could be 1 second or only 100ms for example), then the result of that is summed with the previous, after the scanning is over the sum is divided by the number of buffers scanned.

I suppose I could implement this in MLT. The disadvantage is that it is less convenient and accurate than the current method. But this would be extra work with no benefit. The practical applications for this feature are: <= 6 channels, <=48KHz, <= 30 seconds.

My suggestion would be that we not get too worried about the worst case scenarios (32 channels, 384000Hz) because there aren't real use cases that would require that worst case configuration.

In MLT (and therefore Shotcut), the windowed loudness measurement is an input into an Automatic Gain Control algorithm But isn't this out of the scope for libebur128 though? Couldn't the results from the lib just be "concatenated" in the app (Shotcut or MLT). Think of it this way, if a different lib was to be used you'd have to rely upon that lib to have a big enough window. libeur128 scans loudness and returns a few values, I'm not sure if it should also have "extra" code.

For example, a program I'm working on will have a RMS lib (that I've made myself) and I'll also use libebur128. The user can choose either (a choice between EBU R128 and RMS-Z weighted basically). I'll be displaying a visualization of the values for the entire audio track loudness as well as handling crossfades based on these values. It would make little sense to use the windowing in libebu128 and roll my own just for the RMS lib. I'd either have to add the same to the RMS lib or, (which is what I plan to do) is to add the windowing/waveform/loudness adjustment/gain stuff in the app itself. This is also more future proof as I can just "slot in" new/alternate loudness algos, or use a different loudness lib that may or may not have windowing/history stuff implemented.

My suggestion would be that we not get too worried about the worst case scenarios (32 channels, 384000Hz) because there aren't real use cases that would require that worst case configuration. Probably true. But does MLT (or Shotcut) do a memory check so there isn't a un-handled out of memory situation? You just know there'll be some insane guy out there. laughs So doing a memory usage calculation and failing with a "nope can't do that much" memory error. Unless libebur128 should do the memory check.

BTW! The following is pretty much offtopic (in regards to libebur128 mostly so ignore it if not interested).

Regarding MLT/Shotcut..

Why use AGC? If you do EBU R128 on the entire clip you just need to adjust the gain for the entire clip. Sure if the user split up a video into multiple clips for editing this is no longer correct but the you just recalculate EBU R128 for each clip. Using multithreading (and assuming the user has a 6+ core CPU) this should be relatively quick. If you are relying on the loudness scanner lib you are using to handle your AGC window you might be attacking the problem the wrong way. Wouldn't a AGC library that "uses" libebur128 be better? (presuming it doesn't have similar limitations). The thing with AGC is that it's endless, it was originally meant for live audio broadcasts. I do believe the EBU R128 specs mentions how you should do things for live audio (they are more lenient on the deviation but stricter on the true peak, -2dBTP I believe).

Sorry for rambling on, I just find it kinda ludicrous to use several gigs of memory if one happen to use may channels of audio, memory that would be better used (in this case) for I don't know caching video frames and images/overlays etc. If using AGC gobbles up half a gig to a gig on someones rig if they are a DIY home video creator then that could cause issues (the OS would dive into the pagefile system slowdown etc). So for the long term maybe find a more memory efficient way to do this with Shotcut and MLT perhaps(?).

You say "<= 6 channels, <=48KHz" but is this true if they import multiple audio tracks for layered sound? 6 channels seems "ok" as far as rendering out video, but for editing you might have 20 audio tracks and possibly with different bitdepths and frequencies (they might load a 24bit 96khz flac for example for background music). Or is it just the final rendering output you run the AGC? (which is odd as you'd probably want it on dialog etc). Then again I certainly never do AGC etc in te video editor, audio is prepped in Audacity first (which sadly does not have EBUR128 built in yet) and import that into Shotcut.

Sorry for droning on, I'm sorta the guy at the back of the room that always goes "um, hang on a sec" and questions everything.

But isn't this out of the scope for libebur128 though? Couldn't the results from the lib just be "concatenated" in the app (Shotcut or MLT). Think of it this way, if a different lib was to be used you'd have to rely upon that lib to have a big enough window. libeur128 scans loudness and returns a few values, I'm not sure if it should also have "extra" code.

I would suggest that it is in scope. R128 never even specifies the 0.4s and 3s time constants. It is just a recommendation that references other technical documents. One of those documents is EBU Tech Doc 3341 which does specify the 0.4s and 3s time constants. That same document also says:

There may be cases where it is relevant to use other window lengths or time constants than those specified above. This is allowed in a loudness meter offering ‘EBU Mode’, but it should be clearly indicated on the meter whether or not the set of EBU parameters are in effect (‘EBU Mode’).

Another way to look at it: One could consider the time constant specific functions (ebur128_loudness_momentary & ebur128_loudness_shortterm) to be "extra code". The application could just call ebur128_loudness_window() with the appropriate time constants.

I'll be displaying a visualization of the values for the entire audio track loudness as well as handling crossfades based on these values. It would make little sense to use the windowing in libebu128 and roll my own just for the RMS lib.

I find your application interesting, but I'm not sure I understand how it will work. Shortterm and momentary values are both windowed. They just use different time constants. Which do you plan to use in your application? Momentary (0.4s) or Shortterm (3s)? And you plan to take those windowed values and then feed them into another windowing function?

Why use AGC? If you do EBU R128 on the entire clip you just need to adjust the gain for the entire clip.

Your questions presumes that the clips are pre-mastered and just need a fixed gain offset. But many clips are not pre-mastered. Imagine a video recording that a parent makes of their child's band concert. While the band is playing, the loudness may be fine. But during applause it could be too high and while the conductor is announcing the next song, it could be too low. A fixed gain offset would not result in satisfactory results. The user could chop up the clip and apply a fixed gain adjustment to each clip. But that is not convenient. The AGC provides satisfactory results with a very high convenience. But the user needs to be able to optimize the window duration for their specific situation to get best results.

I just merged a fix for the original problem to the master branch. Please (re-)open if you find any related issues.

jiixyj / libebur128

Maximum window size is 3 milliseconds #85