lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
908 stars 205 forks source link

Add dataset for audio tagging #1241

Closed marcoyang1998 closed 3 months ago

marcoyang1998 commented 7 months ago

This PR adds a dataset for audio tagging. It can be used to train an audio tagging model to predict the sound event of an audio clip.

It requires a new custom field named audio_event in the supervision of each cut, an example of this would be like:

{"id": "balanced/-1TLtjPtnms_10.000.wav", "start": 0.0, "duration": 10.0, "channel": 0, "supervisions": [{"id": "balanced/-1TLtjPtnms_10.000.wav", "recording_id": "balanced/-1TLtjPtnms_10.000.wav", "start": 0.0, "duration": 10.0, "channel": 0, "custom": {"audio_event": "220;137;519"}}], "features": {"type": "kaldi-fbank", "num_frames": 1000, "num_features": 80, "frame_shift": 0.01, "sampling_rate": 16000, "start": 0.0, "duration": 10.0, "storage_type": "lilcom_chunky", "storage_path": "data/fbank_audioset/balanced_balanced_feats/feats-0.lca", "storage_key": "77756,38006,38749", "channels": 0}, "recording": {"id": "balanced/-1TLtjPtnms_10.000.wav", "sources": [{"type": "file", "channels": [0], "source": "downloads/audioset/balanced/-1TLtjPtnms_10.000.wav"}], "sampling_rate": 16000, "num_samples": 160000, "duration": 10.0, "channel_ids": [0]}, "type": "MonoCut"}
marcoyang1998 commented 4 months ago

Added the unit test.