pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
6.07k stars 761 forks source link

LightningDataModules instead of yaml files #452

Closed mogwai closed 3 years ago

mogwai commented 4 years ago

Currently the setup is very powerful for quickly configuring datasets, modules and pipelines for speaker diarization.

During the refactor to pytorch-lightning #227 #407, it might be a good idea to take full advantage of the library by using LightningDataModules for datasets that can be used with pyannote.

Following from the points made in #425, it might be better to promote a python based approach to these things:

hbredin commented 4 years ago

I just read the LightningDataModule documentation.

I understand that you suggest that we should replace the pyannote.database YAML configuration file (~/.pyannote/database.yml) by LightningDataModule.

This raises a bunch of questions:

dm = AMIDataModule()
dm.prepare_data()  # download audio files from AMI official website. might be huge. where? 
dm.setup()  #  chunks duration and target definition depend on the task (VAD? SCD? EMB?). but `dm` is not aware of the task. how should we handle that?

Also, I don't want to get rid of pyannote.database configuration file completely as some users may prefer to use this way of defining their datasets. So we should also have a wrapper for pyannote.database protocols

protocol = get_protocol('MyDataset.SpeakerDiarization.MyDatabase')
dm = ProtocolDataModule(protocol)

But most of all, task definition, model architecture, and dataset are quite intertwined so it is difficult to completely separate data preparation from the rest:

mogwai commented 4 years ago

dm.prepare_data() # download audio files from AMI official website. might be huge. where?

In fastai, a directory in the home folder is created to download data: .fastai/data/. We could do something similar and allow for it to be configured. Having a default makes it faster to get started.

But most of all, task definition, model architecture, and dataset are quite intertwined so it is difficult to completely separate data preparation from the rest:

Yeah that is a good point. I'm going to have a look at other libraries that are using pytorch and see if they have this problem and how they solve it. My first thought though is that you could associate the DataModules (AMI Headset Corpus) and Models (Speaker Activity Detection) together somehow.

protocol = get_protocol('MyDataset.SpeakerDiarization.MyDatabase')
dm = ProtocolDataModule(protocol)

I definitely agree with backward compatibility for pyannote.database. This is a good feature. We should create an API like you've suggested to allow a conversion to a DataModule.