mlcommons / peoples-speech

The People’s Speech Dataset
https://mlcommons.org/en/peoples-speech/
Apache License 2.0
98 stars 12 forks source link

Run AudioSet pre-trained YAMNet model on the entire dataset #28

Open galv opened 3 years ago

galv commented 3 years ago

I would like to run that model on all of our audio files.

Greg made an initial attempt here: https://github.com/greg-landing/yamnet

I was told that there were a few problems:

1) Data loading. The data is currently loaded via a tensorflow "py_func": https://github.com/greg-landing/yamnet/blob/fea79dbdbabf9aca6e1084a3f2c4d3077407f41e/inference.py#L71 This means that we cannot parallelize conversion of mp3 files across multiple threads. The straightforward fix is to add a tfio dependency and use https://www.tensorflow.org/io/api_docs/python/tfio/audio/decode_mp3 2) The model itself may have been limited to work on batch size 1 only, which makes increasing throughput hard, but I'm not sure I understand this part right based on what others have said.

If we can get high enough throughput on a single node (YAMNet model architecture is specifically designed for efficient inference on constrained architectures IIRC), it is preferable to do that. However, if I do this, I may try to go ahead and run this myself via Spark. This is so that I can learn how to load binary data like MP3 files from google cloud storage directly without depending upon gcsfuse, which is known to be buggy and stall from our first time making the dataset (among other undesirable issues). I can also learn how to use GPUs effectively with Spark this way as well, if I find that using T4 GPUs is necessary to achieve reasonable throughput (ideally I would like to analyze the full dataset in a few hours).