ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
1.82k stars 532 forks source link

How to covert Fastspeech2 to Onnx with dynamic input and output ? #139

Open Tian14267 opened 2 years ago

Tian14267 commented 2 years ago

How Can I get dynamic input in torch model to onnx model ? I give input with dynamic_axes, but the output in inference is not dynamic.

My code

        input_names = ['speakers', 'texts', 'src_lens', 'max_src_len']
    output_names = ['output', 'postnet_output', 'p_predictions', 'e_predictions', 'log_d_predictions', 'd_rounded',
                    'src_masks', 'mel_masks', 'src_lens', 'mel_lens']
    dynamic_axes = {
        "texts": {1: "texts_len"},
        "output": {1: "output_len"},
        "postnet_output": {1: "postnet_output_len"},
        "p_predictions": {1: "p_predictions_len"},
        "e_predictions": {1: "e_predictions_len"},
        "log_d_predictions": {1: "log_d_predictions_len"},
        "d_rounded": {1: "d_rounded_len"},
        "src_masks": {1: "src_masks_len"}
    }

    texts_len = 10
    speakers = torch.tensor([0])
    texts = torch.randint(1, 200, (1, texts_len))  
    text_lens = torch.tensor([texts_len])
    max_len = torch.from_numpy(np.array(texts_len)).to(device)
    torch.onnx.export(model, args=(speakers, texts, text_lens, max_len), f="./FastSpeech_2.onnx",
                      input_names=input_names, output_names=output_names, dynamic_axes=dynamic_axes, opset_version=11)

In code , I use src_lens=10, it is ok. But in iinference with this onnx model,when I input with src_lens=50 or other, I get this error:

2022-01-18 16:29:38.644831855 [E:onnxruntime:, sequential_executor.cc:346 Execute] Non-zero status code returned while running Split node. Name:'Split_2888' Status Message: Cannot split using values in 'split' attribute. Axis=0 Input shape={27,256} NumOutputs=10 Num entries in 'split' (must equal number of outputs) was 10 Sum of sizes in 'split' (must equal size of selected axis) was 10
Traceback (most recent call last):
  File "torch2onnx_2.py", line 497, in <module>
    onnx_mode_test()
  File "torch2onnx_2.py", line 471, in onnx_mode_test
    ort_outs = ort_session.run(None, ort_inputs)
  File "/root/anaconda3/envs/tts_fffan/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 192, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Split node. Name:'Split_2888' Status Message: Cannot split using values in 'split' attribute. Axis=0 Input shape={27,256} NumOutputs=10 Num entries in 'split' (must equal number of outputs) was 10 Sum of sizes in 'split' (must equal size of selected axis) was 10

seems that the input len must be 10 , and it can't be dynamic Does somebody help me ?

Pydataman commented 2 years ago

did you covert to onnx successful ?

brooks0519 commented 1 year ago

@Tian14267 The same problems with you. Besides, when I feed the input with same length but different text token, the onnxruntime inference result is worse because of the mel output length seems like related with input data when trace onnx model. This problem has confused me for several weeks, does anyone can give me some ideas? tks!

hungphamNLP commented 1 year ago

@Tian14267 Have you found any way to fix it? i have the same problem as you

brooks0519 commented 1 year ago

Because the output of duration_prediction of different tokens with same length is different. if we convert fastspeech2 to a single onnx model, we finally met the problems above. To fix this, I split the fastspeech2 into three submodels when convert to onnx, and put the length_regulator inference out of onnxruntime. finally, build the sub models inference pipeline with onnxruntime. hope it useful for you!