liusongxiang / StarGAN-Voice-Conversion

This is a pytorch implementation of the paper: StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks
https://arxiv.org/abs/1806.02169
513 stars 93 forks source link

Inference time #3

Closed carlfm01 closed 5 years ago

carlfm01 commented 5 years ago

The paper claims that this could allow at least real time, I would like to know the inference time and the hardware that you are using. Thanks.

sunil3590 commented 5 years ago

@carlfm01 did you get an answer? I have the same question.

carlfm01 commented 5 years ago

Hi @sunil3590 no answer but the paper said "(3)is able to generate converted speech signals quickly enough to allow real-time implementations and", sadly did not mention the hardware required to achieve that.

tranctan commented 4 years ago

Hi, i just came by this issue recently. As i inspected carefully, the model generates the signal pretty fast (milliseconds), however, the total inference time is not real-time at all.

The bottleneck actually comes from the WORLD decomposing step of turning raw input audio into F0s, sp, ap. The time taken corresponds to the length of the audio. In my case, the input length of 14.6s would result in nearly 5s of inferencing.