Open a897456 opened 5 months ago
I am also facing the same problem. You can work around this problem temporarily: https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L121 You can replace the above line with
with open(os.path.join(self.phone_dir, uid + ".phone"), "r") as f:
self.utt2phone[utt] = f.read().strip()
while setting
self.phone_dir = os.path.join(processed_data_dir, 'phones')
in the __init__
of NS2Dataset
You can just comment out the parts containing frame counts because that is only being used to perform dynamic batching. Also, set "use_dynamic_batchsize": false
in exp_config.json
Hi, you need to generate the phone sequence and record the number of frames of samples.
does number of frames mean the number of phones in the phone sequence?
does number of frames mean the number of phones in the phone sequence?
Hi @shreeshailgan , according to the NS2 paper, "As shown in Figure 2, our neural audio codec consists of an audio encoder, a residual vector-quantizer (RVQ), and an audio decoder: 1) The audio encoder consists of several convolutional blocks with a total downsampling rate of 200 for 16KHz audio, i.e., each frame corresponds to a 12.5ms speech segment." You could refer to https://arxiv.org/pdf/2304.09116.pdf for more details.
https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L121 https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L131 https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_trainer.py#L269
These two elements are not integrated into train.json which will be used in ns2_trainer.py