So it looks like newer generation diffsinger models now have linguistic models that take in tokens, word divisions and word durations where the output is encoder_out and x_masks which then feed to the duration.onnx model
Example below(please tell me the if zeroes are needed in the below example)
results = linguistic_model.run(None, {
"tokens":[[26, 1, 22, 35, 11]] ,
"word_div": [[3,2,0,0,0]],
"word_dur": [[48,24,0,0,0]]
})
This project is deprecated now. You can use OpenUTAU for DiffSinger to synthesis with ONNX models. Anyway, this is only a simple demo project, and you can extend it or even re-write it easily
So it looks like newer generation diffsinger models now have linguistic models that take in tokens, word divisions and word durations where the output is encoder_out and x_masks which then feed to the duration.onnx model
Example below(please tell me the if zeroes are needed in the below example) results = linguistic_model.run(None, { "tokens":[[26, 1, 22, 35, 11]] , "word_div": [[3,2,0,0,0]], "word_dur": [[48,24,0,0,0]] })
Happy to get your thoughts, thank you!