32-bit float WAV file is silently converted to 32-bit signed integer.

slowglow commented 3 years ago

I am not sure if this is an issue, I guess it depends on the internal workings of the library, so I'm asking. Is this intentional?

Basically DSP on a PC is more conveniently done in floating point, because one is free of the worry of integers overflowing here and there. So I had my intermediate files (from some other analysis) prepared in Audacity's 32-bit floating point format with a range (-1.0, +1.0). After importing by using audioBasicIO.read_audio_file, I end up with an array of int32.

The actual reading of the audio file is in read_audio_generic these lines:

audiofile = AudioSegment.from_file(input_file)
data = np.array([])
if audiofile.sample_width == 2:
    data = numpy.fromstring(audiofile._data, numpy.int16)
elif audiofile.sample_width == 4:
    data = numpy.fromstring(audiofile._data, numpy.int32)

I notice that there isn't a query as of the actual format of the data in the audio file. Am I missing something?

Tronic commented 3 years ago

Despite following this library out of interest for a while, I have to say I have no idea of its design principles. That being said, I agree that all audio processing should be done exclusively in 32 bit float nominally in -1.0 to 1.0 range. Reasoning:

No overflows (can clip or compress to range right before outputting e.g. to int16 or to audio API that expects strictly -1.0 to 1.0 range)
No error-prone integer division and multiplication required everywhere
Levels do not depend on input signal precision
- A lot of code is written for 16 bit range, failing miserably when given 24 bit audio because the values are way out of range, e.g. in FFT spectrum plotting
Actually very fast to compute, float arithmetics are well supported on all CPUs now (including MCUs and GPUs)
Audio APIs on all OSes already do internal processing (such as mixing sound from multiple applications) in float32
Completely lossless conversion of all 16 bit and 24 bit signed integer values

Only legacy audio APIs and PCM audio files still use integer formats. If you are designing anything, even low level audio stuff like kernel driver, do not support multiple sample formats like all the old ones do, just use floats and convert integers to/from float32 as close to the hardware as possible. The system load comes from frequent polling, not because of the number format used nor because floats use twice the memory/cache/bandwidth.

slowglow commented 3 years ago

Very good points! Thank you!

Now, about the design principles of the library, I don't know either, and the documentation is scarce. In addition, the recent refactoring of the code broke (at least for me) some old programs using the internals of the library. More importantly, the recent code changes are not reflected in the documentation (the wiki).

Fortunately, it is an open source project (Great thanks to you Theodoros!) and in an open discussion a lot of issues can be ironed out. (By the way, where would be the appropriate place for having a discussion ?)

Now, I don't know if these classify as design principles, because they haven't been spelt out explicitly, but some of the points that I really like about the library are:

It is an easy to use end-to-end library giving even an inexperienced user the ability to accomplish key machine learning tasks. I would like it, if it was kept that way. True, this goes against the mantra of 'do one thing and do it right', but it saves a lot time for finding and mastering "the bests tools". Oftentimes "good enough" is just good enough.
I really like the task-modular approach. Keep the Feature extraction, Classification, Regression, Segmentation and Visualization separate. That way, if the users want, they can plug-in their "cutting edge" technology module without much hassle and easily make their own flavor of the library. (That's what I am actually trying to do with the feature extraction module. The extracted features are not great for my application).

What I would like to see:

User-transparent and "truthful" audio file handling, including keeping all DSP in floating point. My use case is amplitude-calibrated sound signal analysis, where the difference of sound pressure levels is a significant "feature". Careless "normalization" and casual "boosting" of the signal is a big "No-No". And so is the conversion of the data types in the middle of the pipe-line. So transparency is important. To amplify this point: librosa does some horrible things when importing audio: it resamples everything to 20500 Hz for the convenience of the developers, but is very open about it, why they do it and gives the users the option to opt out and import their data as they want. If they choose to do so, it is the users' responsibility to track the data format down the pipeline.

Indeed, if Theodoros can jump in and set out some design principles and contribution guidelines, it would be easier to grow a small base of regular contributors, I guess.

After getting side-tracked, I'm getting back to the original issue: If I change the import portion to import as float, can I expect the library to work? Or it will break it because all consecutive handling expects integers?

tyiannak / pyAudioAnalysis

32-bit float WAV file is silently converted to 32-bit signed integer. #328