Add dataset for audio tagging

This PR adds a dataset for audio tagging. It can be used to train an audio tagging model to predict the sound event of an audio clip.

It requires a new custom field named audio_event in the supervision of each cut, an example of this would be like:

{"id": "balanced/-1TLtjPtnms_10.000.wav", "start": 0.0, "duration": 10.0, "channel": 0, "supervisions": [{"id": "balanced/-1TLtjPtnms_10.000.wav", "recording_id": "balanced/-1TLtjPtnms_10.000.wav", "start": 0.0, "duration": 10.0, "channel": 0, "custom": {"audio_event": "220;137;519"}}], "features": {"type": "kaldi-fbank", "num_frames": 1000, "num_features": 80, "frame_shift": 0.01, "sampling_rate": 16000, "start": 0.0, "duration": 10.0, "storage_type": "lilcom_chunky", "storage_path": "data/fbank_audioset/balanced_balanced_feats/feats-0.lca", "storage_key": "77756,38006,38749", "channels": 0}, "recording": {"id": "balanced/-1TLtjPtnms_10.000.wav", "sources": [{"type": "file", "channels": [0], "source": "downloads/audioset/balanced/-1TLtjPtnms_10.000.wav"}], "sampling_rate": 16000, "num_samples": 160000, "duration": 10.0, "channel_ids": [0]}, "type": "MonoCut"}

lhotse-speech / lhotse

Add dataset for audio tagging #1241