Handle audio FileObjects/FileSets in Croissant

marcenacp commented 1 year ago

Proposal:

We propose to handle audio features using https://schema.org/AudioObject.

Technical strategy:

Done: Check that https://schema.org/AudioObject has all the needed attributes. Decision: use additionalProperties if missing (see the discussion https://github.com/mlcommons/croissant/discussions/242).
Add an example toy dataset (used for fixtures in the integration tests). Make the mp3 file as small as possible in order to be able to commit it. You can generate the audio. For example, we generated 1-pixel images in pass-mini.
Add a schema.org constant for sc:AudioObject in _src/core/constants.py
Add a case to handle audio in Python in _src/operation_graph/operations/field.py. We have to choose a library to handle audio. We recommend choosing between librosa or sounddevice or pydub. Before choosing the library, make pros and cons of the library, and publish here to have the validation of a maintainer.
Add unit tests when needed.
Update the Croissant standard in the paragraph Known supported data types:.

This can be split in several PRs.

monke6942021 commented 1 year ago

I think we should look into adding support for https://schema.org/VideoObject and plain binary files at some point too.

fineguy commented 1 year ago

I had a look at some audio libraries, here are my thoughts. In short: I'm in favor of using librosa.

Libraries overview

Things in common:

pros:
- using MIT or ISC or BSD 3-Clause license.
cons:
- still in 0.* version.

librosa:

Uses soundfile or audioread.
pros:
- Has a reach documentation with lots of examples.
- Downstream libraries support many audio formats.
cons:
- Doesn't support integer-value samples. The rationale being that downstream analyses would implicitly to floating point.

sounddevice:

Provides bindings for PortAudio. It's mainly focused on playing and recording audio.

pydub:

Uses ffmpeg or libav(abandoned project) for file reading/writing.
pros:
- Supports practically all audio formats.
cons:
- Returns integer-value samples which might require additional conversion.

soundfile:

Uses libsndfile for file reading/writing.
pros:
- Supports many audio formats.
cons:
- The online documentation is slightly behind the actual repository. E.g. it lacks information about MP3 support.

audioread:

pros:
- Supports many backends for file reading.
cons:
- Doesn't support file writing.

Conclusion

It looks to me that librosa and pydub are the two most used Python libraries for audio processing. pydub was last released in 2021 while librosa has been steadily updated. Given that librosa also has a better documentation, I'd recommend using it.

fineguy commented 1 year ago

I also had a look at the most popular audio datasets from Hugging Face and Papers With Code. They all use either FLAC or WAV audio formats. The only exception is Common Voice which uses MP3.

monke6942021 commented 11 months ago

Hey, I notice that in #242 , one of the attributes that we look into is the bitrate. What do we do if there are multiple bitrates, due to there being multiple mp3 files?

mlcommons / croissant

Handle audio FileObjects/FileSets in Croissant #240

Libraries overview

Conclusion