metavoiceio / metavoice-src

Foundational model for human-like, expressive TTS
https://themetavoice.xyz/
Apache License 2.0
3.78k stars 649 forks source link

How to change similarity and stability in sampling.py? #72

Closed G-force78 closed 7 months ago

G-force78 commented 7 months ago

Hi, great implementation Im impressed by the accuracy of one shot looking forward to the finetune training code released. In the meantime could you tell me how to change similarity and stability in sampling.py? What does it relate to? I am thinking top P and top K? Or guidance_scale: Optional[Tuple[float, float]] = (3.0, 1.0) """Guidance scale for sampling: (speaker conditioning guidance_scale, prompt conditioning guidance ?scale)."""

Are you using some sort of controlnet?

Thanks

vatsalaggarwal commented 7 months ago

@G-force78

You can find code to convert between these values here: https://github.com/metavoiceio/metavoice-src/blob/main/app.py#L29-L36

G-force78 commented 7 months ago

Ok thanks for that. I've noticed youve changed sample.py to fastinference.py however the new script doesnt produce any audio outputs, not that I can find anyway.

2024-02-26 11:38:47 | INFO | DF | Running on torch 2.2.1+cu121 2024-02-26 11:38:47 | INFO | DF | Running on host 855f76734407 fatal: not a git repository (or any of the parent directories): .git 2024-02-26 11:38:47 | INFO | DF | Loading model settings of DeepFilterNet3 2024-02-26 11:38:47 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3 2024-02-26 11:38:47 | INFO | DF | Initializing model deepfilternet3 2024-02-26 11:38:47 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120 2024-02-26 11:38:47 | INFO | DF | Running on device cuda:0 2024-02-26 11:38:47 | INFO | DF | Model loaded Using device=cuda Loading model ... using dtype=float16 Time to load model: 19.44 seconds Compiling...Can take up to 2 mins. 100% 199/199 [00:27<00:00, 7.18it/s] Compilation time: 51.38 seconds

vatsalaggarwal commented 7 months ago

@sidroopdaska

sidroopdaska commented 7 months ago

Hey @G-force78, based on your stack trace above it looks like you haven't run the synthesise() API?

You'll need to run both below

# sets up the model
python -i fam/llm/fast_inference.py 

# runs synthesise. The outputs get stored under the `output/` directory at the root level of the repo. There is also a print statement that shares the output path
tts.synthesise(text="This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.", spk_ref_path="assets/bria.mp3")
G-force78 commented 7 months ago

Hey @G-force78, based on your stack trace above it looks like you haven't run the synthesise() API?

You'll need to run both below

# sets up the model
python -i fam/llm/fast_inference.py 

# runs synthesise. The outputs get stored under the `output/` directory at the root level of the repo. There is also a print statement that shares the output path
tts.synthesise(text="This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.", spk_ref_path="assets/bria.mp3")

Ok thanks I'm testing it in google colab so will probably need some adjustments, does the app do fast inference too? I will just use that if so.

Another question is if this stores the latents (not sure if correct terminology) of each result somewhere so it can be reused?

vatsalaggarwal commented 7 months ago

Ok thanks I'm testing it in google colab so will probably need some adjustments

Ok please let me know if you have any problems

does the app do fast inference too?

Yes

Another question is if this stores the latents (not sure if correct terminology) of each result somewhere so it can be reused?

Yes these get cached to disk

G-force78 commented 7 months ago

I have it working now but on t4 there doesnt seem to be an increase in speed however I havent looked at the exact time it took. Where can I set the cache path so I can keep the latents?

vatsalaggarwal commented 7 months ago

an increase in speed however I havent looked at the exact time it took.

Yeah, it's possible that T4 is too slow (compute or memory bandwidth wise) for our speedups (inspired by gpt-fast) to not matter. Our speedups mainly relate to: i) getting rid of CPU overhead (because other GPUs compute faster than CPU can schedule ops), ii) doing triton compilation via torch.compile so ops get fused...

Maybe the simplest thing to improve speeds on T4 is using int-8? There is some code for this in gpt-fast, and I reckon it should be possible to apply that here.

Where can I set the cache path so I can keep the latents?

It defaults to ~/.cache (Ref: https://github.com/metavoiceio/metavoice-src/blob/main/fam/llm/inference.py#L392-L435 )... you can make changes here if you want to change the path.