slhck / ffmpeg-normalize

Audio Normalization for Python/ffmpeg
MIT License
1.27k stars 117 forks source link

Wrong volume after silence at start of track? #146

Open JoselleAstrid opened 3 years ago

JoselleAstrid commented 3 years ago

I've been using ffmpeg-normalize (EBU R128 method) to normalize the audio of gameplay recordings. Typically the recordings have a peak and LUFS significantly lower than the target volume, and I use ffmpeg-normalize to boost the volume. Sometimes there's silence in the audio, like when the game is loading or paused.

When there are at least 2-3 seconds of silence at the beginning of the audio track, the result I get with ffmpeg-normalize has a lower-than-expected volume right after the silence, and then the volume gradually climbs toward the expected volume over a period of time.

Here's an example. Waveform of original recording:

original

Zooming in on the original recording, to confirm that the volume is reasonably steady:

original_zoomed-in

Normalization result, using ffmpeg-normalize.exe original.aac -nt ebu -t -14 -c:a aac -o normalized.aac - it takes roughly 90 seconds to climb to the volume I'd expect from normalization:

normalized

If I trim most of the silence off the start, and then normalize, the volume seems to be fine throughout the track. Using ffmpeg -ss 11 -i original.aac -copyts trim_11.aac and ffmpeg-normalize.exe trim_11.aac -nt ebu -t -14 -c:a aac -o trim_11_normalized.aac:

trimmed_normalized

Windows 10, Python 3.8, ffmpeg 4.3.2. I'm happy to provide audio uploads, stats, more details/examples, etc. but I thought I'd check first - am I missing something obvious? Is this expected behavior, or am I missing a tuning parameter that would help?

slhck commented 3 years ago

Based on the waveform screenshot it seems that the file has a very low volume to begin with. Is that correct? Maybe this has something to do with the volume detection not working properly. I am not sure what the y axis in Audacity indicates. According to that, normalization results in ~50x amplification of the level.

As a quick check, does the same issue occur if you first peak-normalize the file, via -nt peak, and then run it through -nt ebu?

slhck commented 3 years ago

PS: It may be a bug or special case not handled by the loudnorm ffmpeg filter. This Python project is a somewhat advanced wrapper around the filter. So there isn't much I can do immediately.

JoselleAstrid commented 3 years ago

Based on the waveform screenshot it seems that the file has a very low volume to begin with. Is that correct?

Yes, ffmpeg-normalize with -n -p gives me input_i of -44.39, and input_tp of -29.11.

Maybe this has something to do with the volume detection not working properly. I am not sure what the y axis in Audacity indicates. According to that, normalization results in ~50x amplification of the level.

Ah, there's a way to display the Audacity waveform on a dB scale. The default view seems to be a linear amplitude scale. Here's my original example with the dB view (and also left channel only to save some screen space):

original_db-view

EBU normalized:

normalized_db-view

"The dark blue part of the waveform displays the tallest peak and the light blue part of the waveform displays the average RMS (Root Mean Square) value of the audio" (Source)

The volume doesn't have to be that low for this behavior to happen, though it is more noticeable the lower the volume is.

As a quick check, does the same issue occur if you first peak-normalize the file, via -nt peak, and then run it through -nt ebu?

Peak normalized (-nt peak -t -2):

peak

Peak normalized and then EBU normalized (-nt ebu -t -14); it seems like there's still a steady volume increase until about 25 seconds:

peak-then-ebu

PS: It may be a bug or special case not handled by the loudnorm ffmpeg filter. This Python project is a somewhat advanced wrapper around the filter. So there isn't much I can do immediately.

Understood, and that's fine! It's good to just get a bit of insight on what might be happening. I already had a workaround in mind - detect how much silence there is at the start (with ffmpeg's silencedetect filter), trim off most of that silence, normalize, and then add the silence back.

slhck commented 3 years ago

Thanks for the quick feedback. Interesting that this still happens for a peak-normalized input. So it's rather the property of that particular file with the combination of silence in the beginning that causes the error. Could the case be made that the faint increase in RMS of the first few seconds of signal gets amplified by loudnorm?

Anyway, good that truncating the silence works.

One could think about adding an option that performs auto-truncating of silence at the beginning, basically doing what you're doing, just automatically, but it's a bit of a convoluted solution for an edge case, and it might mess with music files that contain a bit of noise in the beginning, where re-adding pure silence is not an option, etc.

JoselleAstrid commented 3 years ago

Thanks for the quick feedback. Interesting that this still happens for a peak-normalized input. So it's rather the property of that particular file with the combination of silence in the beginning that causes the error. Could the case be made that the faint increase in RMS of the first few seconds of signal gets amplified by loudnorm?

Maybe, although here's an example where RMS doesn't particularly increase during the beginning of the signal:

ex-3_original

Peak normalized (-nt peak -t -2):

ex-3_peak

Peak normalized and then EBU normalized (-nt ebu -t -14); the part from 11-13 seconds doesn't seem as loud as it should be compared to the part after 48 seconds:

ex-3_peak-then-ebu

For the record, it can be easier to see the differences using the linear scale:

ex-3_original_linear

ex-3_peak_linear

ex-3_peak-then-ebu_linear

Anyway, good that truncating the silence works.

One could think about adding an option that performs auto-truncating of silence at the beginning, basically doing what you're doing, just automatically, but it's a bit of a convoluted solution for an edge case, and it might mess with music files that contain a bit of noise in the beginning, where re-adding pure silence is not an option, etc.

I agree, hopefully this just points to some part of the loudnorm implementation that can be fixed so that such a workaround isn't necessary.

ndmgrphc commented 2 years ago

I can confirm this. It's time for the world to accept that loudnorm's implementation of ebu r128 limiting is dangerously opinionated or just plain broken. I wouldn't let it near music, that's for sure. Checking files in Logic Pro with Fabfilter Pro-L2 and it's two very different worlds. Really not sure what this slow ramp up is but it's happening on almost everything. Trying to figure out how to run Fab's VST2 in DPlug now. Again, this is no fault of ffmpeg-normalize and most certainly an upstream failure.

slhck commented 2 years ago

Thanks for your comment. I guess that it's the silence that's throwing it off, no? I see that there is a limiter gain that's being calculated on a running window of samples, and that might explain why the limiter gain is increased frame-by-frame when that window is now filled with samples of a certain volume. I think one might be able to debug the issue by printing some intermediate values in ffmpeg but … I would have to find the time to do that. I've also tried contacting the original author of the filter but I was unsuccessful.

PS: There is a silenceremove filter that we could leverage for this, but since it makes audio and video go async, it'd only be useful for audio-only tracks.

ndmgrphc commented 2 years ago

I don't think it's the silence. As long as we can agree Fabfilter Pro-L2 is the gold standard for mastering and loudness, at least. With loudnorm I tried almost every variation with target level and LRA target (which seemed to have no effect). Fab was effortless as always. I noticed pumping on a lot of loudnorm files and decided to investigate. Here's Kenny Rogers' The Gambler:

1) Source:

image

2) Loudnorm via ffmpeg-normalize (--loudness-range-target=12.0 -ar 96000 --target-level=-18), output is indeed technically -18 LUFS even according to analysis by Pro-L2.

image

3) Fab Pro-L2, default setting, simple +3 gain to get us to -18LUFS integrated (no detectable artifacts, no weird ramp at the beginning)

image

It's so strange. I can't repeat the loudnorm behavior in Fab.

To be fair, Fabfilter is amazing. I wasn't expecting that level of result but I also wasn't expecting this weird behavior.

slhck commented 2 years ago

You're right, this looks odd. Sorry there isn't more that I can do …

ndmgrphc commented 2 years ago

No worries and of course thanks again for your work with this. I might conduct some experiments with dynaudnorm as well just out of curiosity.

Edit: Do not use dynaudnorm for music.

JoselleAstrid commented 2 years ago

Interesting to see that it doesn't need silence at the start to happen.

ndmgrphc commented 2 years ago

I had pretty good results with alimiter just now and I'm not sure how flawed this approach might be. I'm still using the parsed results of ffmpeg-normalize. Maybe this is worth exploring further. I need to run this test on a dozen more tracks with varying dynamics. Given the dBFS scale, though, getting into the LUFS ball park and then applying a conventional limiter isn't going to be too far off, right? The only limitation I can see with alimiter is its attack maximum of 80ms.

Original analysis of master file:

Input Integrated:    -20.4 LUFS
Input True Peak:      -1.1 dBTP
Input LRA:             9.8 LU
Input Threshold:     -31.0 LUFS

I calculated my desired increase in gain which was almost identical to what Fabfilter Pro-L2 required from the same input file. Edit: leave true peak out of this. Ignore this: (I added my true peak and integrated to get the difference to my desired LUFS (-18). -21.5 - -18 = -3.5?)

ffmpeg -i gambler_master.wav -ar 96000 -c:a pcm_s32le -filter_complex "aresample=192000,alimiter=level=false:level_in=2.45dB:limit=-0.1dB:attack=80:release=400" gambler_alimiter3.5-0.wav

Fabfilter Pro-L2 reported LUFS:

image

The difference between Fab and alimiter are inaudible in this ESS 9038 DAC and some decent headphones. I synced the two files down to the sample level and used the alt+solo method in Logic Pro to A/B. I even reversed phase on A and summed with B. It wasn't silence; but there wasn't much left.

"alimiter" filter command from above applying gain to come close to desired LUFS but with no weird dynamics adjustments

image

I'm sure if pushed harder alimiter would show its limitations but for making minor shifts to perceived loudness it seems far better (and more musical) than loudnorm. I'm sure my logic is flawed but I'd love to know your guys' thoughts if you have any.

Edit:

I've tested some more files. Etta James and Wilco. Etta James' "Wallflower" was an interesting test. Ripped from a recent vinyl re-pressing, this track is a perfect case for LUFS analysis as the frequencies pile up in sensitive areas; the resulting input LUFS I was -15.06 (2.94 above my -18 target). In this case even simply reducing with the volume filter like so got me to Fabfilter Pro-L2 registering -17.9 LUFS.

ffmpeg -i etta_master.wav -ar 96000 -c:a pcm_s32le -filter_complex "volume=-2.94dB"  etta_master.volume.wav

The Wilco track was track 1 on yankee hotel foxtrot. This came in at -18.54 LUFS I. The following command again resulted in Fabfilter Pro-L2 reporting an LUFS I of -17.9:

ffmpeg -i wilco_master.wav -ar 96000 -c:a pcm_s32le -filter_complex "aresample=192000,alimiter=level=false:level_in=.54dB:limit=-0.1dB:attack=80:release=400" wilco_master.alimiter.wav

I'm wondering if some sort of --musical flag/mode would be useful.

Using my simple alimiter (or volume filter) technique above, compare the two resulting wave forms if you ever needed a reason to stop listening with your eyes. LUFS is real. These sound identical in perceived volume, -17.9 LUFS I and -18 LUFS I respectively:

image
JoselleAstrid commented 2 years ago

I don't know much about the audio theory, but I guess a normalizer's job is to get to the desired peak and LUFS without undesired distortions, so if alimiter gets you there and then some, that should be ideal? As long as you can consistently get to the desired peak and LUFS ranges without trial and error. Since (IMO) a big idea with tools like ffmpeg-normalize is to automatically process files, perhaps in batch, without having to double-check and parameter-tweak files individually.

If I'm reading correctly, the volume filter does RMS normalization, so it's 'simpler' than loudnorm - although that doesn't necessarily mean worse: https://trac.ffmpeg.org/wiki/AudioVolume (Again, if you're reaching your target loudness range, then it seems like you're good to go.)

I guess it kind of puts into perspective that there are a lot of ways to approach audio mastering - and loudnorm / ffmpeg-normalize is just one thing in the toolkit.

slhck commented 2 years ago

One could definitely also just do a first pass to get the statistics, then do a second pass with RMS normalization with an added limiter to prevent clipping. This might be another mode that could be implemented in ffmpeg-normalize. RMS normalization however won't affect the loudness range or dynamics of the file, so there's still a use case for a proper normalization with loudness range targets.

I think the best way forward would actually be debugging the existing ffmpeg filter, logging the intermediate values of the function that gets called on every frame (perhaps adding a metadata injection), plotting them to see what causes the error.

richardpl commented 2 years ago

Where are samples so this can be fixed?

slhck commented 2 years ago

In https://github.com/slhck/ffmpeg-normalize/issues/146#issuecomment-974418292 the author mentioned a track that could possibly be obtained from, uhm, other sources. There are some waveforms included in the comment that are useful for diagnosis.

richardpl commented 2 years ago

I just can confirm that current loudnorm implementation is not correct at all, the scanner part is working well, but limiter/compressor/expander are buggy, and in worst cases can produce clipped output. This is because it does not take into account new peaks in attack & release stages of limiter.

slhck commented 2 years ago

Thanks for looking into this! Do you think it would be a lot of effort to fix this? I'm afraid I don't know enough about the underlying processing to help with that. As far as I know the original author is no longer actively maintaining the code.

richardpl commented 2 years ago

There is also an issue with timestamps rewriting, which could give issue online processing when gaps are present with timestamps and video too, causing lost of A/V sync.

There is not a lot of effort, it should be just matter of rewriting some chunks of code, currently looking how to do it best.

richardpl commented 2 years ago

Well, for now on you should use 2-pass loudnorm only. 1-pass mess with dynamics too much imho.

slhck commented 2 years ago

Were you able to improve the code with respect to some of the issues? That'd be great!

This tool actually uses only two pass loudnorm, so that is not an issue.

richardpl commented 2 years ago

Two pass loudnorm, if properly used (report at end of summary is still linear and not dynamic), is just volume amplification with single constant gain value for whole audio. I got that gambler audio file, the master mix in ogg format, and I see nothing wrong with how 2 pass loudnorm process file. Perhaps reporter really need high dynamics processing, that may impact loudness range LRA where 2 pass loudnorm can not and should not do it at all.