openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.62k stars 275 forks source link

ONNX Inference Scripts Documentation #198

Open PeterFavero opened 1 week ago

PeterFavero commented 1 week ago

Hello,

I'm interested in running command line inference using the .ckpt's of the model I trained, but after reading the instructions under Inference in docs/GettingStarted.md and the outputs of --help on the appropriate inference scripts (Specifically python scripts/infer.py variance --help and python scripts/infer.py acoustic --help) I don't fully understand the details of how .ds files work and, less importantly, what the details of some of the parameters to infer.py script as well (I largely understand what all of the parameters control but am interested in how to configure --num, --key, --expr, and --step based on a more precise understanding of what they actually do alongside general best practices for those parameters), as there is no thorough documentation here on either of these topics. The .ds docs may be out of scope for this repo (I looked briefly on the original OpenUtau repo and the recommended fork for OpenUtau with DiffSinger, but didn't find anything), but do you know where I could find both such docs to reference for my project?

Thank you, Peter

hrukalive commented 1 week ago

.ds files are just JSON in disguise, you can open it with any text editor. The structure inside is intuitive, so I would not explain them here, but please follow up if you have further questions. To do inference using CLI, you most likely will perform variance inference and then acoustic inference.

Variance inference will add new fields to each "sentence" in the .ds file, such as breath, voicing, or any enabled feature with your checkpoint. You have to use the variance checkpoint to infer every required parameter for the acoustic model. The output of this inference step is a new .ds file.

Then input the .ds file from the previous to the acoustic inference and get you .wav file out. Arguments like --key are transpositions applied to the .ds file globally, so these have nothing to do with inference quality. --step does control quality, but depending on whether you used Rectified Flow model, the recommended steps are different.

PeterFavero commented 1 week ago

Hi there,

Thank you so much for the reply!

I really appreciate the insight, and this is my bad for not mentioning this earlier, but I think it would probably help if I elaborate a bit in terms of my use case. Given some music and phonemic data in a not-necessarily .ds format about sung audio over some time interval (ex, midi/an f0 spectrum and the set of phonemes I want the model to sing, with start and end times for vowel phonemes already set in stone), I want to be able to generate audio of my DiffSinger (composed of a duration-only variance model (predict_dur : true, predict_pitch : false), a pitch-only variance model (predict_dur : false, predict_pitch : true), and an acoustic model w/ energy, breathiness, tension, and voicing all disabled for now) singing that audio using only code and/or CLI commands. Some examples in increasing order of complexity include:

I'd already gotten .ds data from several .wavs in my training dataset from OpenUtau and examined them in VsCode, and while I could understand what each field meant like you said without any issues, I was asking for a bit more granular details. Accomplishing my use case using CLI commands would likely look like generating a .ds file from scratch or editing it procedurally after exporting one from OpenUtau, and then generating specific fields from my two variance models (not sure about the order to apply them) before running the .ds file through the acoustic, possibly with some intermediary editing as well. Because I didn't know exactly how I'd implement this process, which seemed fairly complicated and error-prone, I wanted some more thorough spec ** for how the .ds files and inference scripts worked, since I was changing .ds files only slightly from what I got from OpenUtau early on was getting a lot of difficult-to-understand errors. This is also my first time training a DiffSinger, so that didn't help either.

However, I then checked out the scripts in this repo a little more closely, and noticed that deployment/benchmarks/acoustic.py was doing something relatively similar to my use case using an onnx run time, which I'm a bit more familiar with than the infer.py script and the .ds file type. Additionally, I'm thinking of creating a simple web app for my musician friends and I to use with this model, and that would involve cloud-deploying it, which .onnx is much better suited for towards as well for efficiency and dependency configs. Apologies for the long and semi-tangential thread, but is it possible to use onnx runtimes of my three models to accomplish these goals? I understand I'm reimplementing a lot of what OpenUtau does under the hood here in python, but wanted to get your expertise on the DiffSinger-related portions of my task as I keep working on it.

* I wouldn't use any copyrighted material, I'm just using taking these songs for the sake of example. ** Absolutely no pressure, as I'm more interested in using onnx at this point, but my previous questions included: is there a comprehensive list of possible .ds fields somewhere similar to ConfigurationSchemas.md but for .ds files, which fields are required/optional/forbidden for variance/acoustic models depending on different configs, which fields inference with variance/acoustic models adds or changes depending on configs and infer.py arguments, where and how to properly get or construct the data for the original my_song.ds file for different use cases?

PeterFavero commented 1 day ago

Hello again,

Sorry to bother you, but is there any direction you can point me in to help solve this? I've exported my models to ONNX after you uploaded requirements-onnx.txt (thank you for that, by the way, that was extremely helpful), but now I'm trying to load the appropriate models into a colab session and facing some issues. Whenever I import my duration model as it works fine (as in it loads the InferenceSession without error), but whenever I run either of the following:

VARPITCH_SESSION = ort.InferenceSession("onnx/varpitch/variance_pitch_v2.pitch.onnx")
ACOUSTIC_SESSION = ort.InferenceSession("onnx/acoustic/acoustic_v1.onnx")

No output is printed and session terminates with the error message "your session restarted after a crash." Running this locally, I get a segmentation fault with no further explanation. Do you have any advice on how to fix this?