Standardizing audio datasets

scarecrow1123 commented 5 years ago

Hello! I'm relatively very new to audio processing and I recently started working on speech recognition. I was wondering if torchaudio or -contrib would be a good place to provide standardized distributions of the different publicly available audio/speech datasets. This proposal comes as a result of dealing with too many different formats/directory structures that exist in the current datasets, subsequently having to write different data readers.

First step in this standardization would be to define a format for the dataset distribution. After doing a bit of research, I reckon HDF5 would be a very good fit as it can accommodate different data types in a single file including meta. It also comes with parallel writes, mmaps, etc. which could only make it a better choice. Defining a HDF5 hierarchy, writing converters and converting the existing audio datasets into this single standard format and redistributing them would be great step forward.

Next would be to provide torch dataloaders for these datasets. I reckon the current tochaudio release includes VCTK and YESNO datasets which can also be extended further.

I'm not entirely sure on whether something like this would be a good fit for torchaudio, but it would be great to get some comments.

hagenw commented 5 years ago

Good point, I guess data sets are not as well handled so far as the could be, see also https://github.com/pytorch/audio/issues/116.

Just one comment on "standardized distributions": I would not include it in a way that it will only work with a special format and not also with something simple like a list of WAV files. It might be that other people want to use torchaudio with their own data sets.

rfalcon100 commented 5 years ago

Do you know of any tests proving the advantages of HDF5? I have been testing different ways to load big audio datasets (>3 TB of WAV files), including:

Loading individual npy files. (After offline preprocessing, converting from WAV --> numpy).
Loading individual torch.tensor files. (After offline preprocessing, converting WAV --> torch.tensor)
Single HDF5 file
Single LMDB file
Multiple LMDB files (all data divided in "chunks")

So far, loading individual npy files is the fastest overall, especially when using mmaps and reading only a portion of the file. However, this behavior might depend heavily on the file system.

In torchaudio, the VCTK dataset reads audio files and saves them as a big torch.tensor, but as far as I know this is not ideal when all files have different length. I have only tried saving individual files, and the performance is similar to loading npy.

Rather than a single standard way for datasets, I'd prefer to see some recomendations for different scenarios (e.g., small datasets, big datasets, short files, long files, different audio formats, etc).

keunwoochoi commented 5 years ago

So far, loading individual npy files is the fastest overall,

+1, also for me keeping individual files are the simplest.

keunwoochoi commented 5 years ago

Just a note in general, keeping a dataset loader itself could require quite a lot of workload - for reference, see the PRs of https://github.com/mir-dataset-loaders/mirdata. How about, instead, collecting implementations of customized dataset loader that inherits from Dataset and DataLoader, maybe in an example folder? It is a very tedious job to do.

keunwoochoi / torchaudio-contrib

Standardizing audio datasets #67