openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.68k stars 283 forks source link

ONNX inference 'depth' parameter #176

Closed loct824 closed 6 months ago

loct824 commented 7 months ago

Hi,

Thank you for the help in maintaining the DiffSinger repo.

I have already exported the ONNX model following https://github.com/openvpi/DiffSinger/blob/main/docs/GettingStarted.md#deployment.

I am trying to make inference with the exported ONNX model. I referenced your previous DiffSingerMiniEngine and downloaded all the required dependencies (pitch model, vocoder, etc) and modified the config accordingly.

When I make an API call to localhost:9266/submit, I got below error

2024-03-05 17:31:26 - INFO   : Task '57493d07563b00c43daed67660482bac' begins
127.0.0.1 - - [05/Mar/2024 17:31:26] "POST /submit HTTP/1.1" 200 -
2024-03-05 17:31:26 - WARNING: CUDAExecutionProvider is not available on this machine. Skipping.
2024-03-05 17:31:26 - WARNING: DmlExecutionProvider is not available on this machine. Skipping.
2024-03-05 17:31:29 - ERROR  : Task '57493d07563b00c43daed67660482bac' failed
2024-03-05 17:31:29 - ERROR  : Required inputs (['depth']) are missing from input feed (['tokens', 'durations', 'f0', 'speedup']).

It seems that I need to provide an additional argument of 'depth' in below method:

def acoustic_infer(model: str, providers: list, tokens, durations, f0, speedup):
    session = utils.create_session(model, providers)
    mel = session.run(['mel'], {'tokens': tokens, 'durations': durations, 'f0': f0, 'speedup': speedup})[0]
    return mel

but I am not sure how it should be done. Appreciate any advice on this.

yqzhishen commented 7 months ago

The depth input is introduced by shallow diffusion mechanism, and you can read the documentation for this. Briefly speaking, it equals to K_step in the configuration file for training.

loct824 commented 7 months ago

I got another error:

[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Gather node. Name:'/fs2/txt_embed/Gather' Status Message: indices element out of data bounds, idx=50 must be within the inclusive range [-50,49]

Should I use the ONNX model for code deployement (e.g. building an API)? Does it require significant effort to refactor the code given the latest changes?

yqzhishen commented 7 months ago

Seems like your model have a different phoneme set comparing to the default one in MiniEngine. You should use the right dictionary to infer the model.

However, MiniEngine is no longer maintained. If you do not have strong demand on running models in CLI or on remote host, please consider using OpenUTAU for modern user experience. Also I recommend referring the whole inference procedure from it.

loct824 commented 7 months ago

I managed to make inference after doing below changes:

  1. Changing the reserved tokens to 2 in the config file:

    filename: assets/dictionaries/dictionary.txt
    reserved_tokens: 2

    I understand the reserved tokens mean tokens like AP and SP that are not in the phoneme dictionary? Is that correct?

  2. Adding the depth parameter in the acoustic_infer method:

    def acoustic_infer(model: str, providers: list, tokens, durations, f0, speedup):
    session = utils.create_session(model, providers)
    print(type(tokens))
    print(type(durations))
    print(type(f0))
    print(type(speedup))
    mel = session.run(['mel'], {'tokens': tokens, 'durations': durations, 'f0': f0, 'speedup': speedup, 'depth': np.array(1000)})[0]
    return mel

However, I noticed that there is a significant difference in outputted waveform quality compared to the results I obtained using infer.py in the DiffSinger repo. Would you give some advice on how we might refactor the code in DiffSingerMiniEngine to obtain similar performance in DiffSinger repo?

yqzhishen commented 7 months ago

No, reserved tokens were padding tokens for some historical reasons, and most models nowadays have only 1 reserved token. AP and SP are real tokens. You should make sure the phoneme IDs are correct to get reasonable results.

Are you sure you are using the correct dictionary of the model?

loct824 commented 7 months ago

Yes I am using the exact same dictionary as the one used for training.

I just changed the reserved_token to 1, and it can now give the same results and quality as in infer.py. I guess it is the reserved_token that affected the indices used for phonemes.

Thank you so much for your help!