lucidrains / naturalspeech2-pytorch

Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
MIT License
1.26k stars 100 forks source link

unconditional version seems not to work correctly #30

Open ethanyhzhang opened 11 months ago

ethanyhzhang commented 11 months ago

Hi, thanks for the great job!

I've follow the training process below and use pure audio without conditional input.

I've tried dataset with 110000+ audios and also tried dataset with only 500 audios, but after training for 100000 iterations, output had no meaning using the sampling pipeline with the trained network.

could you pls tell me if there is anything wrong. Or have you trained something meaningful?

Thanks.

from naturalspeech2_pytorch import Trainer

trainer = Trainer(
    diffusion_model = diffusion,     # diffusion model + codec from above
    folder = '/path/to/speech',
    train_batch_size = 16,
    gradient_accumulate_every = 2,
)

trainer.train()
ethanyhzhang commented 11 months ago

FYI the training loss drop from 20+ at the beginning to about 0.3~ at the end.

lanadelray12 commented 10 months ago

您好,能请教您一些问题吗

CHK-0000 commented 6 months ago

안녕하세요, 훌륭한 일을 해주셔서 감사합니다!

아래 교육 과정을 따르고 조건부 입력 없이 순수한 오디오를 사용했습니다.

110,000개 이상의 오디오가 포함된 데이터세트를 시도했고 오디오가 500개만 포함된 데이터세트도 시도했지만 100,000회 반복 학습한 후에는 학습된 네트워크가 포함된 샘플링 파이프라인을 사용하여 출력이 의미가 없었습니다.

혹시 문제가 있으면 알려주실 수 있나요? 아니면 의미 있는 훈련을 받았나요?

감사해요.

from naturalspeech2_pytorch import Trainer

trainer = Trainer(
    diffusion_model = diffusion,     # diffusion model + codec from above
    folder = '/path/to/speech',
    train_batch_size = 16,
    gradient_accumulate_every = 2,
)

trainer.train()

Will the code above train? When I put an audio file in the folder and run the code, I get an error.


import torch

from naturalspeech2_pytorch import Trainer, EncodecWrapper, Model, NaturalSpeech2, SpeechPromptEncoder from multiprocessing import freeze_support

codec = EncodecWrapper()

def main(): model = Model( dim = 128, depth = 6, dim_prompt = 512, cond_drop_prob = 0.25, condition_on_prompt = True )

diffusion = NaturalSpeech2(
    model = model,
    codec = codec,
    timesteps = 50
)

raw_audio = torch.randn(4, 327680)
prompt = torch.randn(4, 32768)

text = torch.randint(0, 100, (4, 100))
text_lens = torch.tensor([100, 50 , 80, 100])

# forwards and backwards

loss = diffusion(
    audio = raw_audio,
    text = text,
    text_lens = text_lens,
    prompt = prompt,
    )

loss.backward()

# after much training

generated_audio = diffusion.sample(
    length = 1024,
    text = text,
    prompt = prompt,
    )

trainer = Trainer(
    diffusion_model = diffusion,
    folder = 'C:\\naturalspeech2-pytorch\\0049_G1A2E7_JHJ',
    train_batch_size = 16,
    gradient_accumulate_every = 2,
    train_num_steps = 5,
    save_and_sample_every = 100,
)

trainer.train()
trainer.save_checkpoint('C:\\naturalspeech2-pytorch\\ansunghun\\checkpoint.pt')

if name == 'main': freeze_support() main()


Traceback (most recent call last): File "c:\naturalspeech2-pytorch\test.py", line 62, in main() File "c:\naturalspeech2-pytorch\test.py", line 57, in main trainer.train() File "c:\naturalspeech2-pytorch\naturalspeech2_pytorch\naturalspeech2_pytorch.py", line 1875, in train loss = self.model(data) File "c:\Users\user.conda\envs\svc\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "c:\Users\user.conda\envs\svc\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl return forward_call(args, **kwargs) File "c:\naturalspeech2-pytorch\naturalspeech2_pytorch\naturalspeech2_pytorch.py", line 1522, in forward text_max_length = text.shape[-1] AttributeError: 'NoneType' object has no attribute 'shape'

a897456 commented 5 months ago

Hi, thanks for the great job!

I've follow the training process below and use pure audio without conditional input.

I've tried dataset with 110000+ audios and also tried dataset with only 500 audios, but after training for 100000 iterations, output had no meaning using the sampling pipeline with the trained network.

could you pls tell me if there is anything wrong. Or have you trained something meaningful?

Thanks.

Hi @ethanyhzhang I have also completed half of the training unconditionally, each epoch generated audio files which sounded like the white noise. Did you have the same problem? How did you fix it? Please share it. THS

a897456 commented 5 months ago

FYI the training loss drop from 20+ at the beginning to about 0.3~ at the end.

Hi @ethanyhzhang My initial loss was 0.2 and after 10k step it was 0.3. And I don't know how to use the .pt file which have be generated in every epoch and has 6 items like that: image 1710999008676 image