Any plans to update BigVSAN with the new BigVGAN-v2?

sony / bigvsan

Pytorch implementation of BigVSAN

MIT License

196 stars 16 forks source link

Any plans to update BigVSAN with the new BigVGAN-v2? #6

Open blaisewf opened 1 month ago

blaisewf commented 1 month ago

https://github.com/NVIDIA/BigVGAN?tab=readme-ov-file#news

TakashiShibuyaSony commented 1 month ago

Thank you for your interest in our work. Currently, we don't have such a plan.

We can incorporate our SAN training framework into even their new discriminators, but we can't get the same training data. And, there's no detailed information what kind of data they used.

When we have a new release plan, we'll inform you. Thank you.

blaisewf commented 1 month ago

Seems to be already implemented by some people https://github.com/reppy4620/vocoders/commit/044bbd6027e8f750782bf62fb9295aa7b327df72

TakashiShibuyaSony commented 1 month ago

Thank you for letting us know! Plus, it's an honor that they tried implementing it.

blaisewf commented 1 month ago

It is amazing what the community can achieve, we are looking to implement it in our code.

I wanted to ask you a question. BigVGAN is a bit heavy in terms of performance, how has this been improved with BigVSAN?

TakashiShibuyaSony commented 1 month ago

What do you mean by "heavy"? You mean it requires a lot of computes and data? If so, actually BigVSAN is the same in terms of them.

If you want to train a vocoder on your own speech/audio data whose scale is not so big, my personal recommendation is use the HiFi-GAN generator with BigVSAN's discriminators (or the new discriminators the commenity implemented) if you don't mind doing some engineering. Otherwise, just using HiFi-GAN or another existing one is good. The above idea is just my idea. I can't guarantee the best performance as I've never tried it.

TakashiShibuyaSony commented 1 month ago

We found that reppy4620 had compared BigVGAN/SAN v1/v2 on the JSUT dataset (10 hours of reading-style speech uttered by a single speaker): https://zenn.dev/reppy/scraps/bfe869f4440129 (Japanese article)

In their experiment, v2 doesn't perform well. It's discriminator might be data hungry. At least, it seems like it has some requirements for performing well.

blaisewf commented 1 month ago

interesting info, thanks for sharing it 😀