openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.62k stars 275 forks source link

Strange humming sound during `SP` & `AP` #179

Closed loct824 closed 2 months ago

loct824 commented 3 months ago

Hi,

We trained an english model for DiffSinger, but we find that for the synthesized songs, in the middle part of the song where SP & AP occurs, the model gives strange voicing that sounds like the singer is humming a constant strange sound.

We give an example below which we use arrows to indicate where that strange humming sound happens.

Could you give us some advice on how the model can be improved/trained to eliminate this strange humming sound during breaks/silence?

'phonemes': [{'name': 'SP', 'duration': 1.3062181},
  {'name': 'AP', 'duration': 0.255292},
  {'name': 'sh', 'duration': 0.1184899},
  {'name': 'uh', 'duration': 0.1555967},
  {'name': 'dx', 'duration': 0.0234931},
  {'name': 'ax', 'duration': 0.075178},
  {'name': 'b', 'duration': 0.0830091},
  {'name': 'ih', 'duration': 0.1427231},
  {'name': 'n', 'duration': 0.06},
  {'name': 's', 'duration': 0.12},
  {'name': 't', 'duration': 0.05},
  {'name': 'r', 'duration': 0.05},
  {'name': 'ao', 'duration': 0.26},
  {'name': 'ng', 'duration': 0.17},
  {'name': 'y', 'duration': 0.05},
  {'name': 'ae', 'duration': 0.23},
  {'name': 'q', 'duration': 0.1454828},
  {'name': 'ay', 'duration': 0.1745172},
  {'name': 'l', 'duration': 0.2},
  {'name': 'ay', 'duration': 0.5},
  {'name': 'd', 'duration': 0.09},
  {'name': 'AP', 'duration': 0.22},
  {'name': 'n', 'duration': 0.0799999},
  {'name': 'ow', 'duration': 0.1300001},
  {'name': 'b', 'duration': 0.04},
  {'name': 'ah', 'duration': 0.1566115},
  {'name': 'dx', 'duration': 0.0233885},
  {'name': 'iy', 'duration': 0.22},
  {'name': 'g', 'duration': 0.16},
  {'name': 'eh', 'duration': 0.2},
  {'name': 't', 'duration': 0.0699999},
  {'name': 's', 'duration': 0.2000001},
  {'name': 'm', 'duration': 0.08},
  {'name': 'iy', 'duration': 0.5209856},
  {'name': 'l', 'duration': 0.0690144},
  {'name': 'ay', 'duration': 0.57},
  {'name': 'k', 'duration': 0.1694275},
  {'name': 'AP', 'duration': 0.2605725},
  {'name': 'y', 'duration': 0.13},
  {'name': 'uw', 'duration': 0.3566036},
  {'name': 'uw', 'duration': 0.7538354},
  {'name': 'SP', 'duration': 1.3531745}, <---------------------
  {'name': 'AP', 'duration': 0.2956739}, <---------------------
  {'name': 'k', 'duration': 0.0907126},
  {'name': 'uh', 'duration': 0.1397},
  {'name': 'dx', 'duration': 0.0203},
  {'name': 'ax', 'duration': 0.09},
  {'name': 'ng', 'duration': 0.06},
  {'name': 'k', 'duration': 0.06},
yqzhishen commented 3 months ago

This seems like a possible labeling issue. If you didn't label the AP and SP areas accurately, the model may pronounce something on these two phonemes.

loct824 commented 3 months ago

Do you mean that it relates to the quality of the transcriptions.csv? whether each labelled phoneme correctly correspond to the part in the audio? Any guidance how we could improve other than manually refine the phoneme time positions labelling? thanks.

yqzhishen commented 3 months ago

If you enabled some variance parameters then controlling them can be a workaround. But on the training side I cannot provide more advice without further information.