Closed mogwai closed 3 years ago
I just read the LightningDataModule documentation.
I understand that you suggest that we should replace the pyannote.database
YAML configuration file (~/.pyannote/database.yml
) by LightningDataModule
.
This raises a bunch of questions:
dm = AMIDataModule()
dm.prepare_data() # download audio files from AMI official website. might be huge. where?
dm.setup() # chunks duration and target definition depend on the task (VAD? SCD? EMB?). but `dm` is not aware of the task. how should we handle that?
Also, I don't want to get rid of pyannote.database
configuration file completely as some users may prefer to use this way of defining their datasets. So we should also have a wrapper for pyannote.database
protocols
protocol = get_protocol('MyDataset.SpeakerDiarization.MyDatabase')
dm = ProtocolDataModule(protocol)
But most of all, task definition, model architecture, and dataset are quite intertwined so it is difficult to completely separate data preparation from the rest:
the model architecture depends on the dataset. for instance the number of classes in the final layer of a speaker embedding network trained with cross-entropy depends on the number of speakers in the dataset. and I would like to keep the current feature that this final layer is built automatically. https://github.com/pyannote/pyannote-audio/blob/a17fe678d23fa3ce4cfe0f69c4b9e31279f83903/pyannote/audio/train/model.py#L123-L131
the model architecture also depends on the task itself. for instance, the number and meaning of classes in the final layer may vary between VAD and SCD: https://github.com/pyannote/pyannote-audio/blob/a17fe678d23fa3ce4cfe0f69c4b9e31279f83903/pyannote/audio/train/model.py#L123-L131
the resolution of the targets (every 10ms? every 25ms?) depends on the model architecture. and I would like to keep the current feature that this is handled automatically. https://github.com/pyannote/pyannote-audio/blob/a17fe678d23fa3ce4cfe0f69c4b9e31279f83903/pyannote/audio/labeling/tasks/speech_activity_detection.py#L67 https://github.com/pyannote/pyannote-audio/blob/a17fe678d23fa3ce4cfe0f69c4b9e31279f83903/pyannote/audio/models/pyannet.py#L185-L192
dm.prepare_data() # download audio files from AMI official website. might be huge. where?
In fastai, a directory in the home folder is created to download data: .fastai/data/
. We could do something similar and allow for it to be configured. Having a default makes it faster to get started.
But most of all, task definition, model architecture, and dataset are quite intertwined so it is difficult to completely separate data preparation from the rest:
Yeah that is a good point. I'm going to have a look at other libraries that are using pytorch and see if they have this problem and how they solve it. My first thought though is that you could associate the DataModules
(AMI Headset Corpus) and Models
(Speaker Activity Detection) together somehow.
protocol = get_protocol('MyDataset.SpeakerDiarization.MyDatabase') dm = ProtocolDataModule(protocol)
I definitely agree with backward compatibility for pyannote.database. This is a good feature. We should create an API like you've suggested to allow a conversion to a DataModule.
Currently the setup is very powerful for quickly configuring datasets, modules and pipelines for speaker diarization.
During the refactor to pytorch-lightning #227 #407, it might be a good idea to take full advantage of the library by using
LightningDataModules
for datasets that can be used with pyannote.Following from the points made in #425, it might be better to promote a python based approach to these things: