Notes below C3 are breathy

hilmiyafia commented 1 year ago

Hello, I have trained a model. It sounds amazing on notes above C3, but below C3 are very breathy. When I checked on tensorboard, it seems the ground truth has become breathy as well. The original wav files are not. I think there's something wrong on the binarization code or the vocoder.

yqzhishen commented 1 year ago

Since the tensorboard ground truth samples are results of copy-synthesis of the vocoder, it is most likely that the vocoder breaks the samples. You may perform a vocoder-only inference using inference/vocoder/val_nsf_hifigan.py.

For the root cause, I guess it is the f0 extractor which caused this problem, and you may verify it by switching between parselmouth and torchcrepe. The torchcrepe method is not supported in binarizer yet, but it is usable in the script mentioned above.

Performance of the model also depends on your pitch distribution. Noise, reverbs and some extreme singing styles may also break the f0 extraction. If you haven't got much data covering that low pitch, these errors will mislead the model, and it is not likely to sing well, too.

If you are willing to share, you may post some useful material here:

statistics of your dataset, like the total length of data, etc.
your pitch distribution plot, if you made your dataset with our standard pipelines.
some recordings in the validation dataset that reached C3.
results of vocoder copy-synthesis (gt in the tensorboard) of these recordings, which have the problem.
results of acoustic model + vocoder inference (pred in the tensorboard) of these recordings.

hilmiyafia commented 1 year ago

@yqzhishen Thank you for the reply, here I attached the samples: Samples.zip

The data is normalized to around -24 dB, so it you need to increase the volume to hear it.

Prefixes:

RAW = is the raw segment files
GT = is the vocoder copy-synthesis with parselmouth
CREPE = is the vocoder copy-synthesis with torchcrepe
PRED = is the acoustic model + vocoder inference with parselmouth

I included 3 samples, one is LOW where the notes are low, you can hear a lot of breathiness, then MEDIUM where the breathiness appears at some low notes, and last is HIGH where there's no breathiness.

From the binarization process, it tells me the training data is 4769.98 seconds (about 1 hour 20 minutes), and the validation data is 40 seconds. And here is the pitch distribution plot:

pitch_distribution

yqzhishen commented 1 year ago

The audio quality of your dataset does not seem ideal, though. The vocoder is trained on a large amount of high-quality singing data, and it is truly possible to fail on your data. However, how many steps have you trained your model, and how about the batch size?

I will look into this issue further once I got time.

hilmiyafia commented 1 year ago

@yqzhishen Do you have any example where the vocoder is successful in copy-synthesizing low notes high-quality singing data? Like around C2 - G2? I think the example would be helpful for users to know what to aim for.

My diffsinger is trained for 77000 steps. I don't remember the batch size because it changes. I think the batch size calculated automatically. But I doubt that training the diffsinger has any effect, since the ground truth itself is broken. As you said, it is the vocoder that is not suitable for my data. The only solution I can think of is fine tuning the vocoder. Do you know if there is any guide how to re-train the nsf-hifigan on my data?

yqzhishen commented 1 year ago

NSF-HiFiGAN is a heavy-weight vocoder that is not possible to train personally. Based on my own experience, the current pretrained vocoder can perform well on most cases, including low pitch data, but sometimes it will fail. If you really want a custom vocoder, you may train a DDSP vocoder according to the README and instructions in the repository yxlllc/pc-ddsp.

hilmiyafia commented 1 year ago

@yqzhishen Oh! It's okay, I found an nsf-hifigan model trained by fish audio. It has better performance on the low notes. Thank you 😊

openvpi / DiffSinger

Notes below C3 are breathy #73