Closed loct824 closed 8 months ago
The depth
input is introduced by shallow diffusion mechanism, and you can read the documentation for this. Briefly speaking, it equals to K_step
in the configuration file for training.
I got another error:
[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Gather node. Name:'/fs2/txt_embed/Gather' Status Message: indices element out of data bounds, idx=50 must be within the inclusive range [-50,49]
Should I use the ONNX model for code deployement (e.g. building an API)? Does it require significant effort to refactor the code given the latest changes?
Seems like your model have a different phoneme set comparing to the default one in MiniEngine. You should use the right dictionary to infer the model.
However, MiniEngine is no longer maintained. If you do not have strong demand on running models in CLI or on remote host, please consider using OpenUTAU for modern user experience. Also I recommend referring the whole inference procedure from it.
I managed to make inference after doing below changes:
Changing the reserved tokens to 2 in the config file:
filename: assets/dictionaries/dictionary.txt
reserved_tokens: 2
I understand the reserved tokens mean tokens like AP
and SP
that are not in the phoneme dictionary? Is that correct?
Adding the depth
parameter in the acoustic_infer
method:
def acoustic_infer(model: str, providers: list, tokens, durations, f0, speedup):
session = utils.create_session(model, providers)
print(type(tokens))
print(type(durations))
print(type(f0))
print(type(speedup))
mel = session.run(['mel'], {'tokens': tokens, 'durations': durations, 'f0': f0, 'speedup': speedup, 'depth': np.array(1000)})[0]
return mel
However, I noticed that there is a significant difference in outputted waveform quality compared to the results I obtained using infer.py
in the DiffSinger repo. Would you give some advice on how we might refactor the code in DiffSingerMiniEngine
to obtain similar performance in DiffSinger repo?
No, reserved tokens were padding tokens for some historical reasons, and most models nowadays have only 1 reserved token. AP and SP are real tokens. You should make sure the phoneme IDs are correct to get reasonable results.
Are you sure you are using the correct dictionary of the model?
Yes I am using the exact same dictionary as the one used for training.
I just changed the reserved_token
to 1, and it can now give the same results and quality as in infer.py
. I guess it is the reserved_token
that affected the indices used for phonemes.
Thank you so much for your help!
Hi,
Thank you for the help in maintaining the DiffSinger repo.
I have already exported the ONNX model following https://github.com/openvpi/DiffSinger/blob/main/docs/GettingStarted.md#deployment.
I am trying to make inference with the exported ONNX model. I referenced your previous DiffSingerMiniEngine and downloaded all the required dependencies (pitch model, vocoder, etc) and modified the config accordingly.
When I make an API call to localhost:9266/submit, I got below error
It seems that I need to provide an additional argument of
'depth'
in below method:but I am not sure how it should be done. Appreciate any advice on this.