slhck / ffmpeg-normalize

Audio Normalization for Python/ffmpeg
MIT License
1.28k stars 118 forks source link

Wiki request, more information about what these loudnorm details mean exactly? #214

Closed CraigWatt closed 1 year ago

CraigWatt commented 1 year ago

Here is an example:

"stream_id": 1,
        "ebu": {
            "input_i": -15.58,
            "input_tp": 6.14,
            "input_lra": 23.3,
            "input_thresh": -27.82,
            "output_i": -26.71,
            "output_tp": -4.48,
            "output_lra": 21.6,
            "output_thresh": -38.56,
            "normalization_type": "dynamic",
            "target_offset": -0.29
        },
        "mean": null,
        "max": null

What do each of these variables mean exactly?

This is the closest I could find that represents documentation. http://k.ylo.ph/2016/04/04/loudnorm.html. but I'm still not 100% sure what these variables mean exactly and what these variables mean in context (input vs output).

Input Integrated:    -27.5 LUFS
Input True Peak:      -4.5 dBTP
Input LRA:            18.1 LU
Input Threshold:     -39.2 LUFS

Output Integrated:   -16.0 LUFS
Output True Peak:     -1.5 dBTP
Output LRA:           14.6 LU
Output Threshold:    -27.2 LUFS

Normalization Type:   Dynamic
Target Offset:        +0.0 LU

For example is an output_thresh of -38.56 considered 'not ideal'? Start to get a little lost at this point.

slhck commented 1 year ago

The detailed description of the algorithm is in Recommendation ITU-R BS.1770-4. If I read this correctly, the threshold you see is determined dynamically (after a first fixed threshold at -70 LKFS):

image

The threshold is used for a gate that is applied to the input signal. It's dependent on the signal and its loudness range, and it's not inherently good or bad — it's just used for the algorithm to determine what constitutes silence and what should be factored into the loudness computation.

If you have a rather low threshold, this indicates a signal with a larger loudness range, where there are many regions with low volume that are still factored into the loudness computation.

The output threshold is, as far as I can tell, the input threshold ± the loudness offset applied by the algorithm. For instance, in your example, the input integrated loudness is -27.5 LUFS; the output is -16.0 LUFS. The difference of 11.5 LUFS is more or less the difference between -39.2 LUFS input threshold and -27.2 LUFS output threshold, which is 12 LUFS.

Basically, when you shift the signal loudness upwards or downwards, the threshold should change accordingly, plus or minus a slight difference (since it's a two-stage process and the first, fixed threshold may remove some parts before).

The respective code in FFmpeg applies the same computations to the input and output of the EBU R128 algorithm:

https://github.com/FFmpeg/FFmpeg/blob/ab8cde6efa84e547ea07a0c47146091a0066c73c/libavfilter/af_loudnorm.c#L826-L851

CraigWatt commented 1 year ago

Brilliant thanks @slhck .

Final question, what would your initial reaction be of this?:

ffmpeg-normalize -nt ebu -t -27 -lrt 18.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 18.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 18.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 18.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output

($vfo_output is just reference to how ffmpeg-normalise is working within my batch video processing program vfo) 🎉

You might be able to tell I'm trying to make use of ffmpeg-normalise ebu to mimic that of what 'Netflix does' aka: https://partnerhelp.netflixstudios.com/hc/en-us/articles/360001794307-Netflix-Sound-Mix-Specifications-Best-Practices-v1-4 . Take note, I'm only concerning myself with stereo atm.

So I'm aiming for -27 LUFS, 18LU loudness range (14-18LU is perhaps more appropriate for living room setting/tv speakers according to Netflix, rather than say, 20), true peak -2.3 (was -2 but Netflix I think tweaked to -2.3).

In this example I'm going for stereo ac3 640k (Could easily be eac3 but facing hardware compatibility reasons atm!)

Anyway, the point being, I'm finding this scenario is one where I want ffmpeg-normalise to use dynamic mode when appropriate (LU above 18 range) so to gently squeeze things up/down while hoping for dialog to remain/become more audible.

In such a scenario, is running ffmpeg-normalise FOUR times appropriate? inappropriate? I've noticed that gradually ffmpeg-normalise is able to bring things into the LU range eventually (maybe after 2-3 runs).

I'm about to run some listening tests of this shortly. I had some success with using the run twice, I'm now about to see if running things 4 times makes things worse/better.

CraigWatt commented 1 year ago

Findings:

Example 1 (starts above LU limiter):

"stream_id": 1,
        "ebu": {
            "input_i": -21.06,
            "input_tp": 0.99,
            "input_lra": 21.8,
            "input_thresh": -33.23,
            "output_i": -26.12,
            "output_tp": -2.3,
            "output_lra": 20.5,
            "output_thresh": -37.97,
            "normalization_type": "dynamic",
            "target_offset": -0.88
        },
        "mean": null,
        "max": null

"stream_id": 1,
        "ebu": {
            "input_i": -27.03,
            "input_tp": -2.65,
            "input_lra": 20.5,
            "input_thresh": -38.9,
            "output_i": -27.88,
            "output_tp": -2.3,
            "output_lra": 19.1,
            "output_thresh": -39.42,
            "normalization_type": "dynamic",
            "target_offset": 0.88
        },
        "mean": null,
        "max": null

"stream_id": 1,
        "ebu": {
            "input_i": -27.07,
            "input_tp": -2.3,
            "input_lra": 19.1,
            "input_thresh": -38.6,
            "output_i": -28.12,
            "output_tp": -2.75,
            "output_lra": 17.4,
            "output_thresh": -39.38,
            "normalization_type": "dynamic",
            "target_offset": 1.12
        },
        "mean": null,
        "max": null

"stream_id": 1,
        "ebu": {
            "input_i": -27.05,
            "input_tp": -2.29,
            "input_lra": 17.3,
            "input_thresh": -38.31,
            "output_i": -27.56,
            "output_tp": -2.3,
            "output_lra": 16.3,
            "output_thresh": -38.67,
            "normalization_type": "dynamic",
            "target_offset": 0.56
        },
        "mean": null,
        "max": null

^^dynamic mode kicks in on first 3 runs only.

Example 2 (already within LU limiter):

"stream_id": 1,
        "ebu": {
            "input_i": -23.44,
            "input_tp": 1.77,
            "input_lra": 17.1,
            "input_thresh": -34.99,
            "output_i": -26.33,
            "output_tp": -2.3,
            "output_lra": 16.4,
            "output_thresh": -37.8,
            "normalization_type": "dynamic",
            "target_offset": -0.67
        },
        "mean": null,
        "max": null

"stream_id": 1,
        "ebu": {
            "input_i": -27.0,
            "input_tp": -2.3,
            "input_lra": 16.4,
            "input_thresh": -38.47,
            "output_i": -27.03,
            "output_tp": -2.3,
            "output_lra": 15.7,
            "output_thresh": -38.39,
            "normalization_type": "dynamic",
            "target_offset": 0.03
        },
        "mean": null,
        "max": null

"stream_id": 1,
        "ebu": {
            "input_i": -27.0,
            "input_tp": -2.3,
            "input_lra": 16.4,
            "input_thresh": -38.47,
            "output_i": -27.03,
            "output_tp": -2.3,
            "output_lra": 15.7,
            "output_thresh": -38.39,
            "normalization_type": "dynamic",
            "target_offset": 0.03
        },
        "mean": null,
        "max": null

"stream_id": 1,
        "ebu": {
            "input_i": -27.0,
            "input_tp": -2.3,
            "input_lra": 16.4,
            "input_thresh": -38.47,
            "output_i": -27.03,
            "output_tp": -2.3,
            "output_lra": 15.7,
            "output_thresh": -38.39,
            "normalization_type": "dynamic",
            "target_offset": 0.03
        },
        "mean": null,
        "max": null

^^dynamic mode NEVER kicks in.

Very interesting! It looks like eventually re-runs can settle into a 'final' output?

The jury is still out though on what this does to perceived audio quality. I will continue to test.

Further Testing Strategy:

Let's maybe go a little crazy and raise the repeats from FOUR to EIGHT, set LU limiter to 20 and see if this strategy allows audio sources (of which are above 20LU) to 'settle around the 18-19 output_lra mark'.

Audio sources already within 20LU will just have to deal with re-runs EIGHT times. I'm sure this will initially bring the LU down but only slightly on the first couple of runs:

ffmpeg-normalize -nt ebu -t -27 -lrt 20.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 20.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 20.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 20.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 20.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 20.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 20.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output ; ffmpeg-normalize -nt ebu -t -27 -lrt 20.0 -tp -2.3 -p -v -f $vfo_output -c:a ac3 -b:a 640k -o $vfo_output

slhck commented 1 year ago

To be honest, I have never heard of anyone running this more than once. Since the link you shared specifies a loudness target with an allowed deviation of ± 2 LU, I don't believe that striving for an exact target is necessary. I don't have the data to back this up, but I think the just-noticeable difference in human hearing is around 1 dB — it depends on the context anyway, and you'll likely not compare two files directly.

slhck commented 1 year ago

See also this very interesting paper called “Loudness normalisation: paradigm shift or placebo for the use of hyper-compression in pop music?”, which references a few studies:

Another experiment conducted by Benjamin in a domestic listening environment sought to gauge the range of levels around the preferred listening level that are accepted as matching the preferred listening level. These domestic listening environments had little background noise and Benjamin found that a +2.91/- 3.78dB level change was enough to prompt listeners to describe the level as noticeably louder/quieter while +6.22/-9.22dB results in the level being perceived as too loud or too quiet. However, Norcross et al. found that listeners were much more sensitive to level changes with subjects on average detecting JNDs of 1.24dB between different programs and JNDs of 0.5dB in the same program . Given that loudness normalisation in broadcast is allowed to deviate +/- 2 LU in America and +/- 1 LU in Europe, it can be surmised that under a loudness normalisation paradigm, listeners will be using their level control primarily to set their preferred listening level.

CraigWatt commented 1 year ago

See also this very interesting paper called “Loudness normalisation: paradigm shift or placebo for the use of hyper-compression in pop music?”, which references a few studies:

This is a fascinating read for sure. Honestly, given the subtle differences in sound I've been hearing, I think DRC (Dynamic Range Control) might just be as important as volume, audio bitrate & audio codec etc for the end user's circumstances.

I believe Dolby has some form of this from the past with the selections of things like RF and line compression.

BUT I think industry leaders are moving into this direction anyway (where hardware allows) such as Netflix with MPEG-D DRC XHE-AAC: https://netflixtechblog.com/optimizing-the-aural-experience-on-android-devices-with-xhe-aac-c27714292a33


I don't believe that striving for an exact target is necessary.

I agree, this is very likely overkill/diminishing returns territory. I would probably need to post each run's audio and run tests to see if me or anyone can perceive a 'preferred' dynamic range and this would all depend on the source material and the sound hardware/environment; to actually get some form of answer.

slhck commented 1 year ago

BUT I think industry leaders are moving into this direction anyway (where hardware allows) such as Netflix with MPEG-D DRC XHE-AAC: https://netflixtechblog.com/optimizing-the-aural-experience-on-android-devices-with-xhe-aac-c27714292a33

Interesting article — I didn't know that this was all now done on a decoding level. Thanks for sharing!

CraigWatt commented 1 year ago

BUT I think industry leaders are moving into this direction anyway (where hardware allows) such as Netflix with MPEG-D DRC XHE-AAC: https://netflixtechblog.com/optimizing-the-aural-experience-on-android-devices-with-xhe-aac-c27714292a33

Interesting article — I didn't know that this was all now done on a decoding level. Thanks for sharing!

For sure!

we use anchor-based (dialogue) measurement, as recommended in A/85. The measured dialog level is delivered in MPEG-D DRC metadata in the xHE-AAC bitstream, using the anchorLoudness metadata set.

For me, I've been looking for a way to do dialogue anchorLoudness but still need to keep digging and researching. Perhaps it's currently proprietary.

CraigWatt commented 1 year ago

Perhaps final point on this:

I'm looking into 'Loudness Range of Dialogue'.

https://www.pro-tools-expert.com/home-page/2018/8/9/loudness-and-dialog-intelligibility-in-tv-mixes-are-tv-mixes-becoming-to-cinematic?utm_campaign=postfity&utm_content=postfity57fa4&utm_medium=social&utm_source=twitter

https://auphonic.com/blog/2020/10/08/dialog-loudness-normalization/

R 128, standardizing loudness normalization to -23 LUFS program loudness, it helped in making programs more evened out overall. However, loud music or sound effects and quiet speech combined can still lead to a production that conforms with the standard.

Perhaps what I was trying to do with multiple runs was brute force R 128 to see if it can achieve dialogue boost with its dynamic mode? I maybe don't understand the difference between dynamic mode and non dynamic mode as far as what ffmpeg-normalise is doing exactly.

'Loudness Range of Dialogue' seems very important moving forward (from Netflix POV, don't want to sideline R 128!). Will continue to research to see if this is something achievable.

See also: https://www.pro-tools-expert.com/home-page/2018/8/23/has-netflix-turned-the-clock-back-10-years-or-is-their-new-loudness-delivery-spec-a-stroke-of-genius

See also: https://www.pro-tools-expert.com/production-expert-1/how-to-optimise-an-audio-mix-for-delivery-to-netflix

Maybe speechnorm can play a role here? I wonder what is being done under the hoot for meters such as 'Youlean Loudness Meter' for speech monitoring etc.

It looks like it all really comes back to "Dolby Dialog Intelligence" being the algorithm that can 'detect speech'.

CraigWatt commented 1 year ago

I think there might be a solution! I'll stop rambling now. Thanks for your help!

slhck commented 1 year ago

Sure! You're welcome. If you have anything interesting to add to our wiki, please do so!