open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.41k stars 373 forks source link

[BUG]: ns2_dataset.py does not have this two part, phones and num_frames, which must be need in ns2_trainer.py #171

Open a897456 opened 5 months ago

a897456 commented 5 months ago

https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L121 https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L131 https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_trainer.py#L269

These two elements are not integrated into train.json which will be used in ns2_trainer.py

shreeshailgan commented 4 months ago

I am also facing the same problem. You can work around this problem temporarily: https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L121 You can replace the above line with

with open(os.path.join(self.phone_dir, uid + ".phone"), "r") as f:
    self.utt2phone[utt] = f.read().strip()

while setting

self.phone_dir = os.path.join(processed_data_dir, 'phones')

in the __init__ of NS2Dataset

You can just comment out the parts containing frame counts because that is only being used to perform dynamic batching. Also, set "use_dynamic_batchsize": false in exp_config.json

HeCheng0625 commented 4 months ago

Hi, you need to generate the phone sequence and record the number of frames of samples.

shreeshailgan commented 4 months ago

does number of frames mean the number of phones in the phone sequence?

HarryHe11 commented 4 months ago

does number of frames mean the number of phones in the phone sequence?

Hi @shreeshailgan , according to the NS2 paper, "As shown in Figure 2, our neural audio codec consists of an audio encoder, a residual vector-quantizer (RVQ), and an audio decoder: 1) The audio encoder consists of several convolutional blocks with a total downsampling rate of 200 for 16KHz audio, i.e., each frame corresponds to a 12.5ms speech segment." You could refer to https://arxiv.org/pdf/2304.09116.pdf for more details.