LightningDataModules instead of yaml files

mogwai commented 4 years ago

Currently the setup is very powerful for quickly configuring datasets, modules and pipelines for speaker diarization.

During the refactor to pytorch-lightning #227 #407, it might be a good idea to take full advantage of the library by using LightningDataModules for datasets that can be used with pyannote.

Following from the points made in #425, it might be better to promote a python based approach to these things:

Easier use of a debugger Many people use a IDE or notebook for development in the AI world so allowing them to step through their code in a familiar fashion helps them to discover problems faster. With a CLI it makes it a bit trickier.
Allow for programmatic exploration of the hyper-parameters using something like ray if so desired.
Data Preparation more straight forward
Potential Windows Support (Not that I care for it but it's a bonus)

hbredin commented 4 years ago

I just read the LightningDataModule documentation.

I understand that you suggest that we should replace the pyannote.database YAML configuration file (~/.pyannote/database.yml) by LightningDataModule.

This raises a bunch of questions:

dm = AMIDataModule()
dm.prepare_data()  # download audio files from AMI official website. might be huge. where? 
dm.setup()  #  chunks duration and target definition depend on the task (VAD? SCD? EMB?). but `dm` is not aware of the task. how should we handle that?

Also, I don't want to get rid of pyannote.database configuration file completely as some users may prefer to use this way of defining their datasets. So we should also have a wrapper for pyannote.database protocols

protocol = get_protocol('MyDataset.SpeakerDiarization.MyDatabase')
dm = ProtocolDataModule(protocol)

But most of all, task definition, model architecture, and dataset are quite intertwined so it is difficult to completely separate data preparation from the rest:

the model architecture depends on the dataset. for instance the number of classes in the final layer of a speaker embedding network trained with cross-entropy depends on the number of speakers in the dataset. and I would like to keep the current feature that this final layer is built automatically. https://github.com/pyannote/pyannote-audio/blob/a17fe678d23fa3ce4cfe0f69c4b9e31279f83903/pyannote/audio/train/model.py#L123-L131
the model architecture also depends on the task itself. for instance, the number and meaning of classes in the final layer may vary between VAD and SCD: https://github.com/pyannote/pyannote-audio/blob/a17fe678d23fa3ce4cfe0f69c4b9e31279f83903/pyannote/audio/train/model.py#L123-L131
the resolution of the targets (every 10ms? every 25ms?) depends on the model architecture. and I would like to keep the current feature that this is handled automatically. https://github.com/pyannote/pyannote-audio/blob/a17fe678d23fa3ce4cfe0f69c4b9e31279f83903/pyannote/audio/labeling/tasks/speech_activity_detection.py#L67 https://github.com/pyannote/pyannote-audio/blob/a17fe678d23fa3ce4cfe0f69c4b9e31279f83903/pyannote/audio/models/pyannet.py#L185-L192

mogwai commented 4 years ago

dm.prepare_data() # download audio files from AMI official website. might be huge. where?

In fastai, a directory in the home folder is created to download data: .fastai/data/. We could do something similar and allow for it to be configured. Having a default makes it faster to get started.

But most of all, task definition, model architecture, and dataset are quite intertwined so it is difficult to completely separate data preparation from the rest:

Yeah that is a good point. I'm going to have a look at other libraries that are using pytorch and see if they have this problem and how they solve it. My first thought though is that you could associate the DataModules (AMI Headset Corpus) and Models (Speaker Activity Detection) together somehow.

protocol = get_protocol('MyDataset.SpeakerDiarization.MyDatabase')
dm = ProtocolDataModule(protocol)

I definitely agree with backward compatibility for pyannote.database. This is a good feature. We should create an API like you've suggested to allow a conversion to a DataModule.

pyannote / pyannote-audio

LightningDataModules instead of yaml files #452