int8 quantized TTS model slower than fp32

martinshkreli commented 8 months ago

(myenv) ubuntu@152:~/sherpa-onnx/python_api_examples$ python3 test.py Elapsed: 0.080 Saved sentence_0.wav. Elapsed: 0.085 Saved sentence_1.wav. Elapsed: 0.080 Saved sentence_2.wav. Elapsed: 0.074 Saved sentence_3.wav. Elapsed: 0.054 Saved sentence_4.wav. Elapsed: 0.081 Saved sentence_5.wav. Elapsed: 0.067

(myenv) ubuntu@152-69-195-75:~/sherpa-onnx/python_api_examples$ python3 test.py Elapsed: 19.561 Saved sentence_0.wav. Elapsed: 26.432 Saved sentence_1.wav. Elapsed: 27.989 Saved sentence_2.wav. Elapsed: 23.956 Saved sentence_3.wav. Elapsed: 11.361 Saved sentence_4.wav. Elapsed: 27.825 Saved sentence_5.wav. Elapsed: 19.567

any special flag to set to use int8?

danpovey commented 8 months ago

Fangjun will get back to you about it, but: hi, martin shkreli! We might need more hardware info and details about what differed between those two runs.

csukuangfj commented 8 months ago

@martinshkreli

Could you describe how you get the int8 models?

martinshkreli commented 7 months ago

Hi guys, thanks again for the wonderful repo. I followed this link to download the model: https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#download-the-model

Then, I used that file (vits-ljs.int8.onnx) for inference in the python script (offline-tts.py). This was on an 8xA100 instance.

martinshkreli commented 7 months ago

@martinshkreli

Could you describe how you get the int8 models?

hi Fangjun, i just wanted to try and get your attention one more time, sorry if I am being annoying!

csukuangfj commented 7 months ago

The int8 model is obtained via the following code https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e074388e2d7/scripts/vits/export-onnx-ljs.py#L204-L208

Note that it uses https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e074388e2d7/scripts/vits/export-onnx-ljs.py#L207

It is a known issue about onnxruntime that quint8 is slower.

For instance, if you search with google, you can find similar issues:

danpovey commented 7 months ago

fangjun, is the int8 intended for different applications or devices then?

On Friday, February 16, 2024, Fangjun Kuang @.***> wrote:

The int8 model is obtained via the following code https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e 074388e2d7/scripts/vits/export-onnx-ljs.py#L204-L208

Note that it uses https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e 074388e2d7/scripts/vits/export-onnx-ljs.py#L207

It is a known issue about onnxruntime that quint8 is slower.

For instance, if you search with google, you can find similar issues:

microsoft/onnxruntime#12854 https://github.com/microsoft/onnxruntime/issues/12854

microsoft/onnxruntime#6732 https://github.com/microsoft/onnxruntime/issues/6732

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/sherpa-onnx/issues/575#issuecomment-1948317748, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO24SJC2ZHERFOMYLKDYT5HQDAVCNFSM6AAAAABC45NFDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBYGMYTONZUHA . You are receiving this because you commented.Message ID: @.***>

csukuangfj commented 7 months ago

int8 model mentioned in this issue is about 4x less in file size than that of float32.

If memory matters, then int8 model is preferred.

beqabeqa473 commented 6 months ago

hi @csukuangfj do you know how to optimize speed of an int8 model? I was experimenting several months ago with it, but i was not able to convert to qint8 and quint8 is really slow on cpu.

nshmyrev commented 6 months ago

You don't need to optimize speed, you need to pick MB-iSTFT VITS model, they are order of magnitude faster than raw VITS with the same quality.

smallbraingames commented 3 months ago

You don't need to optimize speed, you need to pick MB-iSTFT VITS model, they are order of magnitude faster than raw VITS with the same quality.

where can we find these models?

k2-fsa / sherpa-onnx

int8 quantized TTS model slower than fp32 #575