openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.64k stars 275 forks source link

How do I inference from a file? #33

Closed haru0l closed 1 year ago

haru0l commented 1 year ago

As per title, I cannot seem to infer a new project using main.py, and any of the samples projects. Seems to be an error with ph_dur being unable to split due to value being null.

Also I am using a Japanese model (albeit I haven't fixed all of the bugs) with a Japanese dict that I made. Training does seem to run fine, however.

flutydeer commented 1 year ago

Thank you for your feedback. These sample ds files are replaced with some new files that can be used for no-midi inference. Just pull and try again.

Since we keep improving DiffSinger, these ds files may be incompatible with future releases.

yqzhishen commented 1 year ago

This is because models of MIDI-less mode have no ability to predict phone durations themselves. To run inference, you must specify the phones, durations and f0 explicitly. Please note that the current samples in this repository are only compatible with Chinese dictionary. If you are using a model trained with your custom dictionary, you must make sure all phonemes are in the phoneme list. Otherwise, an error will be raised.

haru0l commented 1 year ago

I see, thank you! By the way, how do I get f0_timestep and f0_seq?

flutydeer commented 1 year ago

Ds files are generated by OpenSVIP Converter, which can convert the project of other singing voice synthesizer (e.g. svp, svip) to ds file and write f0_timestep and f0_seq into it at the same time. The converter has only Simplified Chinese version, and may not work well with non-Chinese languages like Japanese.

yqzhishen commented 1 year ago

I see, thank you! By the way, how do I get f0_timestep and f0_seq?

The easiest way is to just use OpenUtau for DiffSinger. which is an unofficial support for the DiffSinger renderer. But please note that the voicebank packaging format of this OpenUtau may vary in the future. If you do want to run inference with PyTorch and your customized ds file, you may have to write some code or scripts since there are no international support from the format converter.

haru0l commented 1 year ago

Thanks for the replies~ I'll look into the OpenSVIP Converter. I tried using the OpenUtau for DiffSinger fork, though I cannot seem to convert my model into the ONNX format since it seems to detect my model as not MIDI-less (even though the config is using nomidi... qwq)

yqzhishen commented 1 year ago

All models trained with config and data built with this pipeline, or from this release, should be in MIDI-less mode. If not, there is something wrong with your dataset making, preprocessing or training procedure. By the way, please do not train models in MIDI mode. Support for any other mode except MIDI-less may be removed in the future.

haru0l commented 1 year ago

It turns out I accidentally used the file wrong qwq I accidentally pointed the exp to the ckpt instead of the singer name Though for some reason I had to downgrade my torch ver to 1.8.1 due to an error of

export() got an unexpecte keyword argument 'example_outputs'

It seems like example_outputs is deprecated (unsure) but atleast I got the ONNX file...

yqzhishen commented 1 year ago

Yes, that script only support PyTorch 1.8, due to some bugs caused by TorchScript, or the official ONNX API, or just bugs in my own code (I have not got time to figure out whose fault it is, but PyTorch 1.8 does run well on this). Anyway, exporting ONNX files does not require a GPU. Installing a CPU PyTorch just for exporting is not very annoying, though.