Support On-the-fly Features Extraction

✨ Description

Support on-the-fly features extraction for the large-scale data preprocessing. Its strengths can be summarized as:

Save the disk space of features storage
Avoid to be stuck by the features extraction due to some computational platform issues
Simplify the preprocessing pipeline and make users focus on training

How to use?

Under the on-the-fly features extraction, the workflow for the future Amphion model is:

Data Preprocess like before
- For train/val dataset split
- For getting the metadata file (.json) like before. The utt["Path"] and utt["Duration"] are the two key elements.
- For getting the medata statistics information like before.
~~Features Preprocess~~ (No features preprocess any more!)
Training
- For config file, you need to set preprocess.features_extraction_mode as online
- Implement your [Task]OnlineDataset and [Model]Trainer
Inference like before

Currently, I have supported DiffWaveNetSVC with on-the-fly features extraction. You can see the two main classes: SVCOnlineDataset and DiffusionTrainer.

👨‍💻 Main Changes

model.base.base_dataset.py:
- Rename the original BaseDataset and BaseCollator to BaseOfflineDataset and BaseOfflineDataset and BaseOfflineCollator
- Implement the BaseOnlineDataset and BaseOnlineCollator. The __getitem__ function will get the minimum elements (such as the raw waveform and its duration)
processors.audio_features_extractor.py:
- In Amphion's latest technical report, we formulate the audio generation tasks into three categories: Text to Waveform, Descriptive Text to Waveform, and Waveform to Waveform. Therefore, we can also implement three kind of features extraction: Text Features, Descriptive Text Features, and Waveform Features.
- In audio_features_extractor.py, I have integrated the common waveform features extraction operation (such as Mel Spectrogram, F0, Energy, and Semantic Features). Note that I have not integrated some vocoder requiring features. @VocodexElysium
- I have created text_features_extractor.py and descriptive_text_features_extractor.py for future TTS, TTA, and TTM's refactor/integration/supplement. @HeCheng0625 @lmxue @HarryHe11 @viewfinder-annn
Support for DiffWaveNetSVC
- two main classes: SVCOnlineDataset and DiffusionTrainer.
Refactor and improve some codes
- Such as re-organizing for config folder as Amphion/config/[Task]/[Model].json.

✅ Checklist

[x] Code has been reviewed
[x] Code complies with the project's code standards and best practices
[x] Code has passed all tests
[x] Code does not affect the normal use of existing features
[x] Code has been commented properly
[x] Documentation has been updated (if applicable)
[x] Demo/checkpoint has been attached (if applicable)

open-mmlab / Amphion

Support On-the-fly Features Extraction #145

✨ Description

How to use?

👨‍💻 Main Changes

✅ Checklist