ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.18k stars 1.19k forks source link

Frame based audio features #983

Closed nieag closed 3 weeks ago

nieag commented 4 years ago

Is your feature request related to a problem? Please describe. Currently the audio pre-processing feature is limited to fixed length audio examples which are either padded or cut to match the audio_file_length_limit_in_s argument. An alternative (or extension) of this is to allow varying length audio files, which at a later stage are cut into fixed length "frames" that may or may not be overlapping on another. An example of this audio strategy can be found in the VGGish and Yamnet code bases.

Describe the use case Frame based audio pre-processing would allow for training and predicting with varying length audio clips, making the resulting model more flexible in its use cases.

Describe the solution you'd like In addition to the audio_file_length_limit_in_s argument, add two additional arguments frame_length and frame_overlap which could be used together with or in place of audio_file_length_limit_in_s to train with variable length input files.

I think it would make sense to bake this into the current audio_feature_mixin class as part of the _get_2D_feature, after computation of any of the 2D feature representations. The output would then instead be a 3D array of the shape [#frames, feature_h, frame_length]. Seeing as this is already implemented as part of the VGGish feature extraction it should be easy to lift over and adapt to the ludwig code base.

Some additional changes might be necessary to the prediction and validation steps (and possibly training) if one wants to compute metrics and losses for the entire 2D feature as a whole, rather than for each individual frame.

Describe alternatives you've considered Currently I would skip the audio pre-processing in ludwig entirely for this use case, and compute the dataset features externally beforehand, and then approach the problem as a normal image classification problem.

Additional context

Would be happy to help out with adding this as a feature to the current audio preprocessing.

w4nderlust commented 4 years ago

@ nieag I believe I understand the request. It's interesting and worth investigating. Do you have a reference for this that we can study? In the links you added it looks like waveform_to_log_mel_spectrogram_patches is the function we would need to replicate for this, but I'd like to look at some papers / systems that do this to use as a reference if you happen to know them. Thank you

nieag commented 4 years ago

Of course, the best example of a working system are probably the research repos from Google AudioSet, the papers hold some info regarding the audio framing, but I think the readme's of the two repos are more informative regarding the audio pre-processing.

Here's another very recent paper that utilises the same approach for handling varying length input in the context of classifying coughs, a situation where a fixed length input is not useful considering that the length of a cough can vary quite a bit Whosecough paper.

Similar framing function is also available in the popular audio library Librosa: audio-framing.

Happy to discuss further.