tyiannak / pyAudioAnalysis

Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications
Apache License 2.0
5.76k stars 1.18k forks source link

How does one go about normalizing '.m4a' audio files. 16-bit PCM '.wav' files are normalized by dividing by 2^16. #307

Open abishek1062 opened 3 years ago

abishek1062 commented 3 years ago

I am sorry if this is the wrong place to address this kind of question. However, can someone please help me out with this issue?

Tronic commented 3 years ago

The title is wrong, you need to divide 16 bit PCM by 32767, not by 2^16. M4A is audio is AAC compressed, so you'll need to decode first and then depending on decoder it may need to be normalised (some implementations might already give you float in -1.0 to 1.0 range). I don't know if pyAudioAnalysis offers any such decoding and if so, how it'd need to be normalized.

abishek1062 commented 3 years ago

@Tronic thank you for taking the time to answer my question! :+1: I meant to write (2^16-1) instead of 2^16 My bad!

According to the https://medium.com/behavioral-signals-ai/basic-audio-handling-d4cc9c70d64d , normalizing a 16-bit PCM file is done by simply dividing by 2¹⁵. This is because we know that the sample resolution is 16 bits per sample.

I decoded the .m4a file to .wav through this command avconv -i input.m4a output.wav

The resultant .wav was 16 bit PCM encoded

Anybody please correct me if I am wrong

Cheers!

Tronic commented 3 years ago

I meant to write (2^16-1) instead of 2^16

You want 2**15 - 1 (not 16) because signed 16 bit integer has range from -32768 to 32767, and the most negative value is usually not used in audio.

If your M4A is correctly decoded into 16 bit PCM WAV, you need to read that into Python as Numpy array, and then something like w.astype(np.float32) / 32767.0, i.e. first convert into float, then divide.

abishek1062 commented 3 years ago

What you are saying is correct, however if one wants to normalize a series of values between -1 and 1, dividing the samples by the maximum possible value is a way to ensure this. Again there are 16 bits per sample i.e we have 16 bits worth of resolution to represent the acoustic signal. Hence, the maximum possible would 2^(16 - 1) I think this is different from signed 16 bit integer.

The logic to normalize the samples is something I agree with :+1:

I am human and I could be wrong :)

slowglow commented 3 years ago

Actually, both normalizations 2**15 (0x8000) and 2**15-1 (0x7fff) are commonly used. See this very appropriately titled blog post:
Int->Float->Int: It's a jungle out there!

Tronic commented 3 years ago

Adding to that mess, the values should be either mathematically rounded or floored towards negative infinity. Integer arithmetics truncate towards zero, which is incorrect because then integer value 0 covers twice as much range as any other sample value (ADCs assign equal range to each value). Furthermore, high quality audio processors tend to add dithering noise right prior to conversion (random numbers just shy of one sample value step).

Conversions both ways need to be designed so that the int->float->int chain keeps the exact original sample values even with such noise being added in between. Unfortunately there is no universal agreement on how precisely this should be done. Fortunately, any differences are very much inaudible and only concern the strict requirement of lossless transfer.

Tronic commented 3 years ago

For reference, https://colab.research.google.com/drive/1ux8HvZrH0KJX5j4H5s_QSPvQl0jk7x1p