ONNX Inference Scripts Documentation

Hello,

I'm interested in running command line inference using the .ckpt's of the model I trained, but after reading the instructions under Inference in docs/GettingStarted.md and the outputs of --help on the appropriate inference scripts (Specifically python scripts/infer.py variance --help and python scripts/infer.py acoustic --help) I don't fully understand the details of how .ds files work and, less importantly, what the details of some of the parameters to infer.py script as well (I largely understand what all of the parameters control but am interested in how to configure --num, --key, --expr, and --step based on a more precise understanding of what they actually do alongside general best practices for those parameters), as there is no thorough documentation here on either of these topics. The .ds docs may be out of scope for this repo (I looked briefly on the original OpenUtau repo and the recommended fork for OpenUtau with DiffSinger, but didn't find anything), but do you know where I could find both such docs to reference for my project?

Thank you, Peter

.ds files are just JSON in disguise, you can open it with any text editor. The structure inside is intuitive, so I would not explain them here, but please follow up if you have further questions. To do inference using CLI, you most likely will perform variance inference and then acoustic inference.

Variance inference will add new fields to each "sentence" in the .ds file, such as breath, voicing, or any enabled feature with your checkpoint. You have to use the variance checkpoint to infer every required parameter for the acoustic model. The output of this inference step is a new .ds file.

Then input the .ds file from the previous to the acoustic inference and get you .wav file out. Arguments like --key are transpositions applied to the .ds file globally, so these have nothing to do with inference quality. --step does control quality, but depending on whether you used Rectified Flow model, the recommended steps are different.

Hi there,

Thank you so much for the reply!

I really appreciate the insight, and this is my bad for not mentioning this earlier, but I think it would probably help if I elaborate a bit in terms of my use case. Given some music and phonemic data in a not-necessarily .ds format about sung audio over some time interval (ex, midi/an f0 spectrum and the set of phonemes I want the model to sing, with start and end times for vowel phonemes already set in stone), I want to be able to generate audio of my DiffSinger (composed of a duration-only variance model (predict_dur : true, predict_pitch : false), a pitch-only variance model (predict_dur : false, predict_pitch : true), and an acoustic model w/ energy, breathiness, tension, and voicing all disabled for now) singing that audio using only code and/or CLI commands. Some examples in increasing order of complexity include:

Instruct my DiffSinger to sing the word "red" [r eh d] from t = 0 -> 0.5, with 'eh' held from t = 0.1 -> 0.4 sec, at a uniform f0 of 440 hz (A4).
Instruct my DiffSinger to sing the word "strength" [s t r eh ng th] from t = 0 -> 0.5, with 'eh' held from t = 0.15 -> 0.35 sec, at a uniform f0 of 440 hz (A4).
Same as above, but with a changing f0-over-time-sequence.
Instruct my DiffSinger to sing the phrase "And I will always love you" [ax n d ay w ih l ao l w ey z l ah v y uw] or "I am titanium" [ay ae m t ay t ey n iy ax m]: *
- Over a time interval corresponding to the length of that line in the original song (which I have complete waveform data for).
- With all vowels lining up to their intervals in the original song.
- According to an f0-over-time-sequence generated from the song (that I could compute myself using PyWorld or, more likely, export from OpenUtau).
- With appropriate slurs between notes and accounting for the possibilities of notes changing within long-held phonemes, such as the "I" in the first example.

I'd already gotten .ds data from several .wavs in my training dataset from OpenUtau and examined them in VsCode, and while I could understand what each field meant like you said without any issues, I was asking for a bit more granular details. Accomplishing my use case using CLI commands would likely look like generating a .ds file from scratch or editing it procedurally after exporting one from OpenUtau, and then generating specific fields from my two variance models (not sure about the order to apply them) before running the .ds file through the acoustic, possibly with some intermediary editing as well. Because I didn't know exactly how I'd implement this process, which seemed fairly complicated and error-prone, I wanted some more thorough spec ** for how the .ds files and inference scripts worked, since I was changing .ds files only slightly from what I got from OpenUtau early on was getting a lot of difficult-to-understand errors. This is also my first time training a DiffSinger, so that didn't help either.

However, I then checked out the scripts in this repo a little more closely, and noticed that deployment/benchmarks/acoustic.py was doing something relatively similar to my use case using an onnx run time, which I'm a bit more familiar with than the infer.py script and the .ds file type. Additionally, I'm thinking of creating a simple web app for my musician friends and I to use with this model, and that would involve cloud-deploying it, which .onnx is much better suited for towards as well for efficiency and dependency configs. Apologies for the long and semi-tangential thread, but is it possible to use onnx runtimes of my three models to accomplish these goals? I understand I'm reimplementing a lot of what OpenUtau does under the hood here in python, but wanted to get your expertise on the DiffSinger-related portions of my task as I keep working on it.

* I wouldn't use any copyrighted material, I'm just using taking these songs for the sake of example. ** Absolutely no pressure, as I'm more interested in using onnx at this point, but my previous questions included: is there a comprehensive list of possible .ds fields somewhere similar to ConfigurationSchemas.md but for .ds files, which fields are required/optional/forbidden for variance/acoustic models depending on different configs, which fields inference with variance/acoustic models adds or changes depending on configs and infer.py arguments, where and how to properly get or construct the data for the original my_song.ds file for different use cases?

Hello again,

Sorry to bother you, but is there any direction you can point me in to help solve this? I've exported my models to ONNX after you uploaded requirements-onnx.txt (thank you for that, by the way, that was extremely helpful), but now I'm trying to load the appropriate models into a colab session and facing some issues. Whenever I import my duration model as it works fine (as in it loads the InferenceSession without error), but whenever I run either of the following:

VARPITCH_SESSION = ort.InferenceSession("onnx/varpitch/variance_pitch_v2.pitch.onnx")

ACOUSTIC_SESSION = ort.InferenceSession("onnx/acoustic/acoustic_v1.onnx")

No output is printed and session terminates with the error message "your session restarted after a crash." Running this locally, I get a segmentation fault with no further explanation. Do you have any advice on how to fix this?

To reply to the first post, I am unsure about the question on how to construct a DS file because the answer is using an editor such as OpenUtau. I doubt anyone can manually code the DS file directly simply because, for example, it is hard to code a f0 curve.

Once you have the ONNX, I am not quite sure what the problem is. If you exported the ONNX according to the guide (PyTorch 1.13.1 and the requirements file you mentioned), then they should work fine. You can verify whether they are correctly exported using tools such as https://netron.app/. You can also find out about the input and output arguments using this tool for ONNX.

Hello again,

Sorry to bother you, but is there any direction you can point me in to help solve this? I've exported my models to ONNX after you uploaded requirements-onnx.txt (thank you for that, by the way, that was extremely helpful), but now I'm trying to load the appropriate models into a colab session and facing some issues. Whenever I import my duration model as it works fine (as in it loads the InferenceSession without error), but whenever I run either of the following:
VARPITCH_SESSION = ort.InferenceSession("onnx/varpitch/variance_pitch_v2.pitch.onnx")
ACOUSTIC_SESSION = ort.InferenceSession("onnx/acoustic/acoustic_v1.onnx")
No output is printed and session terminates with the error message "your session restarted after a crash." Running this locally, I get a segmentation fault with no further explanation. Do you have any advice on how to fix this?

HI PETER, IS THE SEGMENTATION ERROR SIMILAR TO THIS?? IF YES WERE YOU ABLE TO FIX IT?

(singer) C:\Users\Administrator\nnsvs-db-converter>python db_converter.py -L C:\Users\Administrator\nnsvs-db-converter\language-def.json -mD -c C:\Users\Administrator\nnsvs-db-converter\vlabeler_dataset_final 07/16/24 Tue 08:55:27 - INFO: Finding all labels. 07/16/24 Tue 08:55:27 - INFO: Found 1 label. 07/16/24 Tue 08:55:27 - INFO: Making directories and files. 07/16/24 Tue 08:55:27 - INFO: Reading C:\Users\Administrator\nnsvs-db-converter\vlabeler_dataset_final\wav\audio_file.wav. 07/16/24 Tue 08:55:27 - INFO: Estimating pitch for C:\Users\Administrator\nnsvs-db-converter\vlabeler_dataset_final\wav\audio_file.wav 07/16/24 Tue 08:55:27 - INFO: Segmenting C:\Users\Administrator\nnsvs-db-converter\vlabeler_dataset_final\lab\audio_file.lab. 07/16/24 Tue 08:55:27 - INFO: Splitting wave file and preparing transcription lines. 07/16/24 Tue 08:55:27 - INFO: Segment 1 / 4 07/16/24 Tue 08:55:27 - WARNING: Detected pure silence either from segment label or note sequence. Skipping. 07/16/24 Tue 08:55:27 - INFO: Segment 2 / 4 07/16/24 Tue 08:55:27 - WARNING: Detected pure silence either from segment label or note sequence. Skipping. 07/16/24 Tue 08:55:27 - INFO: Segment 3 / 4 07/16/24 Tue 08:55:27 - WARNING: Detected pure silence either from segment label or note sequence. Skipping. 07/16/24 Tue 08:55:27 - INFO: Segment 4 / 4 07/16/24 Tue 08:55:27 - WARNING: Detected pure silence either from segment label or note sequence. Skipping. 07/16/24 Tue 08:55:27 - INFO: Writing all transcripts. 07/16/24 Tue 08:55:27 - INFO: Took 0.19544469999999992 seconds

openvpi / DiffSinger

ONNX Inference Scripts Documentation #198