openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.62k stars 275 forks source link

variance model onnx exporter #117

Closed blueyred closed 11 months ago

blueyred commented 11 months ago

Hi, thanks for the recent updates, amazing work!

I'm testing the onnx exporters for the variance models, its likely I'm mis-understanding the flow for the variance models, but I have an issue. I've exported the 3 variance models without issue and the linguistic and duration models are working as expected. However the pitch predictor won't accept my input.

Should the output of the duration predictor variance model (ph_dur_pred) be fed as "ph_dur" to the predict_pitch variance model?

The input ph_dur for predict_pitch is defined as torch.LongTensor, on my test I saw int64[1,n_tokens]

https://github.com/openvpi/DiffSinger/blob/main/deployment/exporters/variance_exporter.py#L146

On my test the output (ph_dur_pred) from the duration predictor, ph_dur_pred was float32[1,n_tokens].

Is it possible that the ph_dur which is used for the tracing of the pitch model is the wrong type?

thanks

yqzhishen commented 11 months ago

The duration predictor gives out float32 because ph_dur_pred is only the raw phoneme duration output. Neural networks aren't 100% accurate, and most of the time ph_dur_pred needs to be forced aligned to the word boundaries. The current ONNX graph contains no alignment procedures, and I am talking to OpenUTAU developers about whether to integrate this procedure into ONNX (faster) or let OpenUTAU do this (sometimes more reasonable). In any case, we should manually ensure ph_dur sum to n_frames. Using float32 can reserve the precision for this.

Should the output of the duration predictor variance model (ph_dur_pred) be fed as "ph_dur" to the predict_pitch variance model?

Yes, they are actually the same thing.

blueyred commented 11 months ago

Ah, thanks, I'd missed the RhythmRegulator step, confirm these are all working for me in C# now. thanks for clearing that up. :pray:

MrDiplodocus commented 11 months ago

@yqzhishen , @blueyred , I have successfully made an inference of the models: linguistic.onnx and dur.onnx and I want to predict pitch by pitch.onnx. I got names of inputs and outputs for this model: inputs => ['encoder_out', 'ph_dur', 'note_midi', 'note_dur', 'pitch', 'retake', 'speedup'] outputs => ['pitch_pred']

To get pitch_pred, I need to put something in pitch and retake. Can you please explain what needs to be inserted there?

blueyred commented 11 months ago

@MrDiplodocus encoder_out is one of the outputs from the linguistic model. ph_dur is the prediction from the duration model. note_midi is the note midi numbers as floats. note_dur is the note durations in frames pitch is a list of midi notes retake - I just fill with ones speedup I set to 10

Here is a list with the tensor sizes, which should help you get your data in order.

INPUTS name: encoder_out type: float32[1,n_tokens,256] name: ph_dur type: int64[1,n_tokens] name: note_midi type: float32[1,n_notes] name: note_dur type: int64[1,n_notes] name: pitch type: float32[1,n_frames] name: retake type: boolean[1,n_frames] name: speedup type: int64)

OUT name: pitch_pred type: float32[1,n_frames]

blueyred commented 11 months ago

@MrDiplodocus reading through this code helped me work through the data flow modules/fastspeech/tts_modules.py

yqzhishen commented 11 months ago

pitch should be your preset pitch curve on frames that retake==0, or anything where retake==1 retake should be all 1 on frames that you want to generate new pitch curve These two values are for the local retaking machenism, and if you don't need that, fill retake with ones.

retake - I just fill with zeros

@blueyred have you made a mistake here?

blueyred commented 11 months ago

yqzhishen you're completely correct (my mistake!), scan reading my code too quickly, it is all ones.

MrDiplodocus commented 11 months ago

thanks, it became clear, but I still get error when pitch.onxx model is inferred:

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Add node. Name:'/pre/Add' Status Message: /pre/Add: right operand cannot broadcast on dim 1 
LeftShape: {1,332,256}, RightShape: {1,331,256}

n_frames=331, and sum(dur_pred_align)=332. I thought that was the problem, but when I traced scripts/infer.py variance, I got absolutely identical values of variables and their dimensions, and I successfully received predictions, but the same onnx's inference does not work.

yqzhishen commented 11 months ago

@MrDiplodocus Python inference code already includes forced shape alignment for these variables. ONNX models do not contain that, so you should do that manually.

MrDiplodocus commented 11 months ago

@yqzhishen thank you very much!