open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
5.91k stars 452 forks source link

Support On-the-fly Features Extraction #145

Closed RMSnow closed 8 months ago

RMSnow commented 8 months ago

✨ Description

Support on-the-fly features extraction for the large-scale data preprocessing. Its strengths can be summarized as:

How to use?

Under the on-the-fly features extraction, the workflow for the future Amphion model is:

  1. Data Preprocess like before
    • For train/val dataset split
    • For getting the metadata file (.json) like before. The utt["Path"] and utt["Duration"] are the two key elements.
    • For getting the medata statistics information like before.
  2. Features Preprocess (No features preprocess any more!)
  3. Training
    • For config file, you need to set preprocess.features_extraction_mode as online
    • Implement your [Task]OnlineDataset and [Model]Trainer
  4. Inference like before

Currently, I have supported DiffWaveNetSVC with on-the-fly features extraction. You can see the two main classes: SVCOnlineDataset and DiffusionTrainer.

👨‍💻 Main Changes

  1. model.base.base_dataset.py:
    • Rename the original BaseDataset and BaseCollator to BaseOfflineDataset and BaseOfflineDataset and BaseOfflineCollator
    • Implement the BaseOnlineDataset and BaseOnlineCollator. The __getitem__ function will get the minimum elements (such as the raw waveform and its duration)
  2. processors.audio_features_extractor.py:
    • In Amphion's latest technical report, we formulate the audio generation tasks into three categories: Text to Waveform, Descriptive Text to Waveform, and Waveform to Waveform. Therefore, we can also implement three kind of features extraction: Text Features, Descriptive Text Features, and Waveform Features.
    • In audio_features_extractor.py, I have integrated the common waveform features extraction operation (such as Mel Spectrogram, F0, Energy, and Semantic Features). Note that I have not integrated some vocoder requiring features. @VocodexElysium
    • I have created text_features_extractor.py and descriptive_text_features_extractor.py for future TTS, TTA, and TTM's refactor/integration/supplement. @HeCheng0625 @lmxue @HarryHe11 @viewfinder-annn
  3. Support for DiffWaveNetSVC
  4. Refactor and improve some codes
    • Such as re-organizing for config folder as Amphion/config/[Task]/[Model].json.

✅ Checklist

RMSnow commented 8 months ago

The recipe should be updated to provide instructions for online feature extraction.

@lmxue Good advice. I plan to update the recipe in the future. This PR is to prepare a codebase for our recent internal research.