MetaVoice-1B: fix degradation compared to Python version

huggingface / candle

Minimalist ML framework for Rust

Apache License 2.0

14.73k stars 843 forks source link

MetaVoice-1B: fix degradation compared to Python version #1801

Open vatsalaggarwal opened 4 months ago

vatsalaggarwal commented 4 months ago

The MetaVoice-1B model has significant degradation compared to the Python version. I believe one of the main causes is using a 64x smaller decoder model (instead of multiband diffusion and deepfilternet).

Multiband diffusion is a general purpose diffusion based model that can decode Encodec tokens (which is a Neural Audio codec, and can model diverse speech including music, and audio). So there are additional benefits to have this in the Candle codebase for any other LLMs in the audio/music/speech space.

DeepFilterNet is a powerful speech enhancement model, and so there are also additional benefits to having this within candle.

vatsalaggarwal commented 4 months ago

@LaurentMazare does that seem right? Or are there other places where significant quality degradation could be coming from?

LaurentMazare commented 4 months ago

Right, I think this might explain for most of the difference, I've aligned the first model carefully with a temperature of 0 but not the second model, so there might be other discrepancies coming from there. Another difference is that speaker embeddings are not fully supported in candle at the moment though hopefully this won't be too hard to add (I've started making the appropriate changes).

vatsalaggarwal commented 4 months ago

Saw the note about the speaker embeddings in your README, that makes sense and, as you say, should be quick to fix! Ah, I got what you meant by "implementation discrepancies" now re: the second stage...

groovybits commented 4 months ago

Curious what the path is to get the quality working better?

Is it known piece of work TODO or more research TODO before knowing?

I can poke and dig more, have been focused on other issues but they seem to be fixed. I'm not sure what to look at since not sure if it's requiring something that is a known issue/solution or needs more investigation?

Thanks!

vatsalaggarwal commented 4 months ago

Hey Chris, I would say it's a known piece of work... we'd have to change the decoder currently integrated into candle...

Catchawink commented 2 months ago

Any updates on speaker embeddings support? I'd like to work on it if no one else is currently.

LaurentMazare commented 2 months ago

Any updates on speaker embeddings support? I'd like to work on it if no one else is currently.

I'm not looking at it at the moment, would be great if you can give it a try!