MPS Support - Githubissues

metavoiceio / metavoice-src

Foundational model for human-like, expressive TTS

https://themetavoice.xyz/

Apache License 2.0

3.78k stars 650 forks source link

MPS Support #1

Open fakerybakery opened 8 months ago

fakerybakery commented 8 months ago

Hi, Congrats on the launch! Is MPS (Apple Silicon) or MLX support planned? Thank you!

sidroopdaska commented 8 months ago

Hey @fakerybakery, can you share your use-case in more detail? MPS / MLX support is not planned in the short-term. But, we'd love to collaborate if you would like to implement this for our model?

daankortenbach commented 8 months ago

Great work on metavoice, sounds really good! Wish I could use it om my Mac.

Use case: all Apple Silicon (Mac, iPhone, iPad, Vision). There is a huge lack of good tts for Apple and there is demand. Take note of @fakerybakery, he's everywhere and doing good work.

cocktailpeanut commented 8 months ago

Is it difficult to implement MPS support? The use case is basically the entire Apple ecosystem users. I was going to write a cross platform installer for the app (I work on https://pinokio.computer) and ended up here, but now the fact that there's no plan for supporting MPS i'm kind of confused. I believe the true hockey stick growth for your model will come not from a small number of entrepreneurs who run a server and charge for service, but every single user with a laptop.

fakerybakery commented 8 months ago

I think a major issue is that this uses flash-attn, which is currently not supported on either MPS or ROCm.

vatsalaggarwal commented 8 months ago

yep @fakerybakery is right. But, a chunk of what's required to fix this is already implemented, it’s just not been hooked up properly. Will try to do it, but happy for someone to beat me to it!

Here’s the ref : https://github.com/metavoiceio/metavoice-src/blob/main/fam/llm/layers/attn.py

It contains code to mix and match the following options:

flash decoding (which does kv-caching and attn calculations within a single optimised kernel, requires newer NVIDIA GPUs),
fa2 + vanilla kv caching (which uses fa2 - an optimised attn kernel - for attn calculation but does kv caching using standard PyTorch ops. But due to fa2 it still requires newer NVIDIA GPUs),
torch attn + vanilla kv caching (which does both attn calculation and kv caching using torch standard ops, so should work across the board)

So one needs to change some plumbing in the code iirc to use the third option and I believe it should work!

cocktailpeanut commented 8 months ago

That sounds awesome, and really glad to see that you guys care. Looking forward to the progress!

sidroopdaska commented 8 months ago

Thanks for participating. Can't wait to review a PR here :)

groovybits commented 7 months ago

+1 for this :) need this so bad on my Mac Ultra M2! Mimic3 TTS is just not great and OpenAI costs a bit much for 24/7 use.

I may try to see if I can figure it out with the "torch attn + vanilla kv caching" but not too hopeful of my luck with that :D Any details on what needs to be done may help me at least try, I just haven't done MPS / GPU support very much (hacked at getting bark.cpp working but was a kludge / not a full MPS dev experience).

Curious about where resources are to learn about that more, since I am really coding a lot and have this Mac I want to work with everything so might as well put some effort in on the MPS support like this if I can get the hang of it.

pyetras commented 7 months ago

Hey folks, I pushed a change that should fix the problems mentioned here, could you give it another go?

ahmetkca commented 7 months ago

Hey folks, I pushed a change that should fix the problems mentioned here, could you give it another go?

I tried to install requirements.txt but now getting ModuleNotFoundError: No module named 'torch'. I am on Mac apple silicon. Do I need to install torch seperatly?

sidroopdaska commented 7 months ago

Looking into this now :)

ahmetkca commented 7 months ago

Looking into this now :)

idk if this could help, https://github.com/facebookresearch/xformers/issues/740#issue-1695177874

ahmetkca commented 7 months ago

Looking into this now :)

idk if this could help, facebookresearch/xformers#740 (comment)

it seems like xFormer is not supported on Mac (not 100% sure), based on the following comment: https://github.com/facebookresearch/xformers/issues/740#issuecomment-1594080277

Edit: In the GitHub repo of xFormer it says 'RECOMMENDED Linux & Win'

vatsalaggarwal commented 7 months ago

I tried to make this yesterday with the recent changes pushed by @pyetras but we unfortunately still have some set of small issues beyond the ones outlined above... will keep this thread updated!

vatsalaggarwal commented 7 months ago

by the way, if you want to resolve the above installation errors, below works:

pip install torch torchvision torchaudio
pip install -r requirements.txt
pip install --upgrade torch torchvision torchaudio

ahmetkca commented 7 months ago

by the way, if you want to resolve the above installation errors, below works:
pip install torch torchvision torchaudio
pip install -r requirements.txt
pip install --upgrade torch torchvision torchaudio

I tried it but still no luck. No module named 'torch' found.

groovybits commented 7 months ago

I seem to have mine up to a point where flash-attn isn't available on my Mac and has worked beyond xformers. I had to use python3.11 for some reason, my issue outside of the docker (which has another issue 139 failure which may also indicate it is MPS) is here: https://github.com/metavoiceio/metavoice-src/issues/48 where it says NameError: name 'flash_attn_with_kvcache' is not defined. I followed the same commands as in the Dockerfile in my fork branch where I got it building in the docker and worked up to the point of it crashing but then on native mac it got further up to the kvcache flash attn issue.

This shows the part that prints out that it didn't load the flash_attn and hence flash_attn_with_kvcache is not available.

 => [metavoice-server internal] load build definition from Dockerfile                                                                                                                                        0.0s
 => => transferring dockerfile: 937B                                                                                                                                                                         0.0s
 => [metavoice-server internal] load metadata for docker.io/library/python:3.11-slim                                                                                                                         1.2s
 => [metavoice-server internal] load .dockerignore                                                                                                                                                           0.0s
 => => transferring context: 2B                                                                                                                                                                              0.0s
 => [metavoice-server 1/7] FROM docker.io/library/python:3.11-slim@sha256:ce81dc539f0aedc9114cae640f8352fad83d37461c24a3615b01f081d0c0583a                                                                   0.1s
 => => resolve docker.io/library/python:3.11-slim@sha256:ce81dc539f0aedc9114cae640f8352fad83d37461c24a3615b01f081d0c0583a                                                                                    0.0s
 => => sha256:ce81dc539f0aedc9114cae640f8352fad83d37461c24a3615b01f081d0c0583a 1.65kB / 1.65kB                                                                                                               0.0s
 => => sha256:238b008604c229d4897d1fa131f9aaecddec61a199edbdb22851622dd65dcebd 1.37kB / 1.37kB                                                                                                               0.0s
 => => sha256:cfa17c2baa64b89de4e828d2f7f219dc88a61599b076ef7ea08c653f6df56b74 6.95kB / 6.95kB                                                                                                               0.0s
 => [metavoice-server internal] load build context                                                                                                                                                           0.2s
 => => transferring context: 29.01MB                                                                                                                                                                         0.2s
 => [metavoice-server 2/7] RUN apt-get update && apt-get install -y     ffmpeg     ninja-build     g++     git     curl     build-essential     libomp-dev     && rm -rf /var/lib/apt/lists/*               24.7s
 => [metavoice-server 3/7] WORKDIR /app                                                                                                                                                                      0.0s
 => [metavoice-server 4/7] COPY . .                                                                                                                                                                          0.0s
 => [metavoice-server 5/7] RUN MAX_JOBS=1 pip install --no-cache-dir "torch>=2.1.0"                                                                                                                         10.1s
 => [metavoice-server 6/7] RUN MAX_JOBS=1 pip install --no-cache-dir -r requirements.txt                                                                                                                   135.1s
 => [metavoice-server 7/7] RUN pip install -e .                                                                                                                                                              3.4s
 => [metavoice-server] exporting to image                                                                                                                                                                    2.8s
 => => exporting layers                                                                                                                                                                                      2.8s
 => => writing image sha256:d20f670219780639211c6c762183bafca3ae7c00f1e55aa597f20e6f43912086                                                                                                                 0.0s
 => => naming to docker.io/library/metavoice-server:latest                                                                                                                                                   0.0s
[+] Running 2/0
 ✔ Network metavoice-src-groovybits_metavoice-net  Created                                                                                                                                                   0.0s
 ✔ Container metavoice-server                      Created                                                                                                                                                   0.0s
Attaching to metavoice-server
metavoice-server  | WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
metavoice-server  |     PyTorch 2.2.0 with CUDA None (you have 2.1.0)
metavoice-server  |     Python  3.11.8 (you have 3.11.8)
metavoice-server  |   Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
metavoice-server  |   Memory-efficient attention, SwiGLU, sparse and more won't be available.
metavoice-server  |   Set XFORMERS_MORE_DETAILS=1 for more details
metavoice-server  | /usr/local/lib/python3.11/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
metavoice-server  |   from torchaudio.backend.common import AudioMetaData
metavoice-server  | /app/fam/llm/layers/attn.py:14: UserWarning: flash_attn not installed, make sure to replace attention mechanism with torch_attn
metavoice-server  |   warnings.warn("flash_attn not installed, make sure to replace attention mechanism with torch_attn")
Fetching 6 files: 100% 6/6 [00:00<00:00, 43996.20it/s]
metavoice-server  | number of parameters: 1239.00M
metavoice-server  | loading configuration file config.json from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/config.json
metavoice-server  | Model config EncodecConfig {
metavoice-server  |   "_name_or_path": "ArthurZ/encodec_24khz",
metavoice-server  |   "architectures": [
metavoice-server  |     "EncodecModel"
metavoice-server  |   ],

pyetras commented 7 months ago

You need to change kv_cache type to "vanilla" in https://github.com/metavoiceio/metavoice-src/blob/main/fam/llm/serving.py#L188 to avoid depending on flash_attn

groovybits commented 7 months ago

You need to change kv_cache type to "vanilla" in https://github.com/metavoiceio/metavoice-src/blob/main/fam/llm/serving.py#L188 to avoid depending on flash_attn

Thank you!

Update:

It hits an issue with the float type used, even when fiddling around with it changing the torch_attn choice between "hand" or forcing the types to be right for q,k,v it still fails in another place within the torch packages files. This is what I see when changing to "vanilla" type...

All the weights of EncodecModel were initialized from the model checkpoint at facebook/encodec_24khz.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training.
2024-02-14 07:13:53 | INFO     | DF | Running on torch 2.1.0
2024-02-14 07:13:53 | INFO     | DF | Running on host earth.local
2024-02-14 07:13:53 | INFO     | DF | Git commit: eb7338abb, branch: stable
2024-02-14 07:13:53 | INFO     | DF | Loading model settings of DeepFilterNet3
2024-02-14 07:13:53 | INFO     | DF | Using DeepFilterNet3 model at /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3
2024-02-14 07:13:53 | INFO     | DF | Initializing model `deepfilternet3`
2024-02-14 07:13:53 | INFO     | DF | Found checkpoint /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
2024-02-14 07:13:53 | INFO     | DF | Running on device cpu
2024-02-14 07:13:53 | INFO     | DF | Model loaded
INFO:     Started server process [84212]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:58003 (Press CTRL+C to quit)
getting cached speaker ref files:   0%|                                                                                                                                                     | 0/1 [00:00<?, ?it/s][src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.
getting cached speaker ref files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.96it/s]
calculating speaker embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1264.87it/s]
batch:   0%|                                                                                                                                                                                | 0/1 [00:00<?, ?it/s][hack!!!!] Guidance is on, so we're doubling/tripling batch size!                                                                                                                        | 0/1728 [00:00<?, ?it/s]
tokens:   0%|                                                                                                                                                                            | 0/1728 [00:00<?, ?it/s]
batch:   0%|                                                                                                                                                                                | 0/1 [00:00<?, ?it/s]
Error processing request {'text': 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model by MetaVoice.', 'guidance': [3.0, 1.0], 'top_p': 0.95, 'speaker_ref_path': 'https://cdn.themetavoice.xyz/speakers/bria.mp3'}
Traceback (most recent call last):
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/serving.py", line 105, in text_to_speech
    wav_out_path = sample_utterance(
                   ^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 546, in sample_utterance
    return _sample_utterance_batch(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 477, in _sample_utterance_batch
    b_tokens = first_stage_model(
               ^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 356, in __call__
    return self.causal_sample(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 231, in causal_sample
    y = self.model.generate(
        ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/model.py", line 369, in generate
    return self._causal_sample(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 410, in _causal_sample
    batch_idx = self._sample_batch(
                ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 264, in _sample_batch
    idx_next = self._sample_next_token(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 85, in _sample_next_token
    list_logits, _ = self(
                     ^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/model.py", line 282, in forward
    x = block(x)
        ^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/layers/combined.py", line 50, in forward
    x = x + self.attn(self.ln_1(x))
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/layers/attn.py", line 303, in forward
    y = self._torch_attn(c_x)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/layers/attn.py", line 231, in _torch_attn
    y = torch.nn.functional.scaled_dot_product_attention(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: c10::BFloat16 and value.dtype: c10::BFloat16 instead.
INFO:     127.0.0.1:58092 - "POST /tts HTTP/1.1" 500 Internal Server Error

pyetras commented 7 months ago

Strange error, does MPS support bfloat16? You could try setting dtype="float16" in the common_config dict in serving, looks like the bfloat is coming from the kvcache, while the model is running on float32

AbeEstrada commented 7 months ago

I'm working on this https://github.com/abeestrada/metavoice-src/commit/9f2bab004fc9a46f19fa5c4ca154b82b4c332b87

python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/bria.mp3" --dtype="float32" --use_kv_cache="vanilla"

Now I'm stuck at this error, maybe flash_attn needs to be replaced with something else

NameError: name 'flash_attn_qkvpacked_func' is not defined

When I use --dtype="float16", this is the error:

  File "/Code/metavoice-sirc/fam/llm/sample.py", line 207, in causal_sample
    assert x[i, 0, : seq_lens[i]].tolist() == encoded_texts[i]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

AbeEstrada commented 7 months ago

I made torch_attn as default (while debugging) and first I got this:

failed to run MBD.
reason: argument 'tokens': 'float' object cannot be interpreted as an integer

Fixed with:

to_return.append(self.decoder.decode(tokens=int(tokens.item()), causal=False))

And we got some progress:

failed to run MBD.
reason: a Tensor with 8192 elements cannot be converted to Scalar

https://github.com/AbeEstrada/metavoice-src/commits/mps/

pyetras commented 7 months ago

@AbeEstrada You shouldn't be updating the types when it's torch.long, this is what causes errors down the line

AbeEstrada commented 7 months ago

Made the changes https://github.com/AbeEstrada/metavoice-src/commit/f4415864c31cab010d6034c8f965de855ead0929

I was able to run it with float32

python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/bria.mp3" --dtype="float32"

And torch.amp.autocast doesn't support MPS yet https://github.com/pytorch/pytorch/issues/88415 and I used the CPU in the meantime https://github.com/AbeEstrada/metavoice-src/commit/a77fdb9fd4d94e4bb4c16dc86c4a62a580f874f1, maybe there is a .to(device) missing somewhere.

failed to run MBD.
reason: Placeholder storage has not been allocated on MPS device!

And the API should work now with --use_kv_cache="vanilla https://github.com/AbeEstrada/metavoice-src/commit/77d84a02121f6dcc4c43d412e8110582e2ac959b, I'm not using the right X-Payload format, can anyone give it a try with my branch (https://github.com/AbeEstrada/metavoice-src/commits/mps)?:

python fam/llm/serving.py --huggingface_repo_id "metavoiceio/metavoice-1B-v0.1" --dtype "float32" --use_kv_cache="vanilla"

groovybits commented 7 months ago

Looks the same to me on an M2, I get the....

failed to run MBD.
reason: Placeholder storage has not been allocated on MPS device!

Good job! Seems closer, very exciting.

rohitsainier commented 7 months ago

while running with vanilla getting this warning:

UserWarning: flash_attn not installed, make sure to replace attention mechanism with torch_attn warnings.warn("flash_attn not installed, make sure to replace attention mechanism with torch_attn") ╭─ Unrecognized options ───────────────────────╮ │ Unrecognized options: --use-kv-cache=vanilla │ │ ──────────────────────────────────────────── │ │ For full helptext, run serving.py --help │ ╰──────────────────────────────────────────────╯ [2024-02-18 21:34:39,479] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics: [2024-02-18 21:34:39,479] torch._dynamo.utils: [INFO] Function, Runtimes (s)

rohitsainier commented 7 months ago

i have removed the flash_attn from requirments

rohitsainier commented 7 months ago

by the way, if you want to resolve the above installation errors, below works:
pip install torch torchvision torchaudio
pip install -r requirements.txt
pip install --upgrade torch torchvision torchaudio
I tried it but still no luck. No module named 'torch' found.

remove flash_attn from requirements before installing the deps

run : pip install --upgrade pip setuptools wheel

then install deps using pip install -r requirements.txt

it worked for me

mattkanwisher commented 7 months ago

It seems like OSX still doesn't work even for the most basic sample?


venv) ➜  metavoice-src git:(main) python fam/llm/sample.py --device="cpu" --spk_cond_path="assets/bria.mp3" --text="This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model." --dtype="bfloat16"
objc[19467]: Class AVFFrameReceiver is implemented in both /Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x102cd4760) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x144574370). One of the two will be used. Which one is undefined.
objc[19467]: Class AVFAudioReceiver is implemented in both /Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x102cd47b0) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x1445743c0). One of the two will be used. Which one is undefined.
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0 with CUDA None (you have 2.2.0)
    Python  3.10.11 (you have 3.10.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
/Users/username/projects/ai/metavoice-src/fam/llm/layers/attn.py:10: UserWarning: flash_attn not installed, make sure to replace attention mechanism with torch_attn
  warnings.warn("flash_attn not installed, make sure to replace attention mechanism with torch_attn")
Fetching 6 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 119837.26it/s]
number of parameters: 1239.00M
number of parameters: 14.07M
getting cached speaker ref files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.18it/s]
calculating speaker embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1766.77it/s]
batch:   0%|                                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s[hack!!!!] Guidance is on, so we're doubling/tripling batch size!                                                                                                                                                     | 0/1728 [00:00<?, ?it/s]
tokens:   0%|                                                                                                                                                                                                        | 0/1728 [00:00<?, ?it/s]
batch:   0%|                                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 700, in <module>
    sample_utterance(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 544, in sample_utterance
    return _sample_utterance_batch(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 475, in _sample_utterance_batch
    b_tokens = first_stage_model(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 354, in __call__
    return self.causal_sample(
  File "/Users/username/projects/ai/metavoice-src/fam/llm/sample.py", line 229, in causal_sample
    y = self.model.generate(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/model.py", line 369, in generate
    return self._causal_sample(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/mixins/causal.py", line 410, in _causal_sample
    batch_idx = self._sample_batch(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/mixins/causal.py", line 264, in _sample_batch
    idx_next = self._sample_next_token(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/mixins/causal.py", line 85, in _sample_next_token
    list_logits, _ = self(
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/model.py", line 282, in forward
    x = block(x)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/layers/combined.py", line 50, in forward
    x = x + self.attn(self.ln_1(x))
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/layers/attn.py", line 221, in forward
    y = self._torch_attn(c_x)
  File "/Users/username/projects/ai/metavoice-src/fam/llm/layers/attn.py", line 189, in _torch_attn
    y = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: c10::BFloat16 and value.dtype: c10::BFloat16 instead.```

groovybits commented 7 months ago

Yes something still is wrong, I hit it with the error off the previous posted branch with work towards it...

failed to run MBD.
reason: Placeholder storage has not been allocated on MPS device!

I'm also watching the Candle Rust patch that is "in progress" right now that will allow another path for this. Not sure how long that will take but Candle and Metal play nice and would be very clean efficient and safe yet compiled.

vatsalaggarwal commented 7 months ago

Which branch is this?

Just to be sure, my understanding is that the first stage and second stage models work correctly, but we're now stuck on audiocraft's multi band diffusion. Out of curiosity, how long is the synthesis for first stage and second stage taking (there will be tqdm progress bars for this that you can read the time off)

Could someone send the whole stack trace for the placeholder storage error? It looks like: 1) there's something not being allocated on mps (ref: https://discuss.pytorch.org/t/torch-embedding-fails-with-runtimeerror-placeholder-storage-has-not-been-allocated-on-mps-device/152124) 2) something related to the multiband diffusion from audio craft, but could be how we're using it or could be a bug in their code.

I want to set expectations here... we can produce 10 seconds of speech in seconds on a GPU. I've a feeling that it might be closer to minutes if you're using CPU or even potentially MPS.

I think to get it to work, we'll have to go towards using the ANE on the Mac for which we'll need to use coremltools to convert the model into Apple's format (ref: https://apple.github.io/coremltools/docs-guides/source/convert-pytorch-workflow.html)... there are a couple of "gotchas" here but this is likely to be significantly faster for inference than MPS/CPU...

groovybits commented 7 months ago

Here is the full stacktrace. This happens immediately so I cannot tell how long it would take...

loading weights file model.safetensors from cache at /Users/chris/.cache/huggingface/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/model.safetensors
All model checkpoint weights were used when initializing EncodecModel.

All the weights of EncodecModel were initialized from the model checkpoint at facebook/encodec_24khz.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training.
2024-02-21 05:20:59 | INFO     | DF | Running on torch 2.1.0
2024-02-21 05:20:59 | INFO     | DF | Running on host earth
2024-02-21 05:20:59 | INFO     | DF | Git commit: eb7338abb, branch: stable
2024-02-21 05:20:59 | INFO     | DF | Loading model settings of DeepFilterNet3
2024-02-21 05:20:59 | INFO     | DF | Using DeepFilterNet3 model at /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3
2024-02-21 05:20:59 | INFO     | DF | Initializing model `deepfilternet3`
2024-02-21 05:20:59 | INFO     | DF | Found checkpoint /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
2024-02-21 05:20:59 | INFO     | DF | Running on device cpu
2024-02-21 05:20:59 | INFO     | DF | Model loaded
INFO:     Started server process [29194]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:58003 (Press CTRL+C to quit)
getting cached speaker ref files:   0%|                                                                                                                                         | 0/1 [00:00<?, ?it/s][src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.
getting cached speaker ref files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.18it/s]
calculating speaker embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1013.12it/s]
tokens:   0%|                                                                                                                                                                | 0/1728 [00:00<?, ?it/s]
batch:   0%|                                                                                                                                                                    | 0/1 [00:00<?, ?it/s]
Error processing request {'text': 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model by MetaVoice.', 'guidance': 3.0, 'top_p': 0.95, 'speaker_ref_path': 'https://cdn.themetavoice.xyz/speakers/bria.mp3'}
Traceback (most recent call last):
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/serving.py", line 109, in text_to_speech
    wav_out_path = sample_utterance(
                   ^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/sample.py", line 547, in sample_utterance
    return _sample_utterance_batch(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/sample.py", line 478, in _sample_utterance_batch
    b_tokens = first_stage_model(
               ^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/sample.py", line 357, in __call__
    return self.causal_sample(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/sample.py", line 232, in causal_sample
    y = self.model.generate(
        ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/model.py", line 370, in generate
    return self._causal_sample(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/mixins/causal.py", line 410, in _causal_sample
    batch_idx = self._sample_batch(
                ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/mixins/causal.py", line 264, in _sample_batch
    idx_next = self._sample_next_token(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/mixins/causal.py", line 85, in _sample_next_token
    list_logits, _ = self(
                     ^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-mps/fam/llm/model.py", line 259, in forward
    speakers_embedded = self.speaker_cond_pos(speaker_embs)  # shape (b, num_examples, c)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Placeholder storage has not been allocated on MPS device!
INFO:     127.0.0.1:54670 - "POST /tts HTTP/1.1" 500 Internal Server Error

vatsalaggarwal commented 7 months ago

~is https://github.com/groovybits/metavoice-src/tree/docker the right branch? Can you open a PR with the code please?~ found it

vatsalaggarwal commented 7 months ago

I fixed this error in a hacky way and have opened a PR here: https://github.com/metavoiceio/metavoice-src/pull/69

@AbeEstrada fyi, it was a minor change on top, but wasn't able to push to your branch so ended up pushing here... the relevant commit is https://github.com/metavoiceio/metavoice-src/pull/69/commits/8a0a96fa990f8de73920195fb547228ffda4acc2

vatsalaggarwal commented 7 months ago

Sadly, we now run into the following error from multiband diffusion :(

getting cached speaker ref files: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.49it/s]
calculating speaker embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 566.49it/s]
batch:   0%|                                                                                                                          | 0/1 [00:00<?, ?it/s[hack!!!!] Guidance is on, so we're doubling/tripling batch size!                                                                   | 0/1728 [00:00<?, ?it/s]
tokens:   4%|████▍                                                                                                        | 71/1728 [00:16<06:21,  4.34it/s]
batch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.38s/it]
Text: Hi.
non-causal batching: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.11it/s]
Text: Hi.
failed to run MBD.
reason: The operator 'aten::_fft_r2c' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
Traceback (most recent call last):
  File "/Users/vatsalaggarwal/Documents/workspace/metavoice-src/fam/llm/sample.py", line 335, in non_causal_sample
    to_return.append(self.decoder.decode(tokens=tokens.tolist(), causal=False))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/Documents/workspace/metavoice-src/fam/llm/decoders.py", line 87, in decode
    wav = self.mbd.tokens_to_wav(tokens)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/audiocraft/models/multibanddiffusion.py", line 190, in tokens_to_wav
    wav_diffusion = self.generate(emb=condition, size=wav_encodec.size())
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/audiocraft/models/multibanddiffusion.py", line 148, in generate
    out += DP.generate(condition=emb, step_list=step_list, initial_noise=torch.randn_like(out))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/audiocraft/models/multibanddiffusion.py", line 44, in generate
    return self.schedule.generate_subsampled(model=self.model, initial=initial_noise, step_list=step_list,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/audiocraft/modules/diffusion_schedule.py", line 272, in generate_subsampled
    return self.sample_processor.return_sample(previous)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/audiocraft/modules/diffusion_schedule.py", line 106, in return_sample
    bands = self.split_bands(x)
            ^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/julius/bands.py", line 84, in forward
    lows = self.lowpass(input)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/julius/lowpass.py", line 107, in forward
    out = fft_conv1d(input, self.filters, stride=self.stride)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/julius/fftconv.py", line 119, in fft_conv1d
    weight_z = _rfft(weight)
               ^^^^^^^^^^^^^
  File "/Users/vatsalaggarwal/miniconda3/envs/os-metavoice/lib/python3.11/site-packages/julius/fftconv.py", line 26, in _new_rfft
    z = new_fft.rfft(x, dim=-1)
        ^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: The operator 'aten::_fft_r2c' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

vatsalaggarwal commented 7 months ago

Which can be fixed by using CPU fallback! So the following command works:

PYTORCH_ENABLE_MPS_FALLBACK=1 python fam/llm/sample.py \
  --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" \
  --spk_cond_path="assets/bria.mp3" \
  --text "Hi."

I've pushed all the changes to the same PR

groovybits commented 7 months ago

~That works!~ it's super slow but ~yes works~ almost works running the fallback env with sample.py...

Update: fails at the end still...

failed to run MBD.
reason: Placeholder storage has not been allocated on MPS device!

 PYTORCH_ENABLE_MPS_FALLBACK=1 python fam/llm/sample.py \
  --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" \
  --spk_cond_path="assets/bria.mp3" \
  --text "Hi I am an AI that loves to sleep"

tokens: 9%|████████████▉ | 149/1728 [00:56<13:43, 1.92it/s]

My M2 Ultra is using GPU on it, not very much CPU, yet definitely a slow 1.92it/s so I suspect I see what is meant by on MPS it being slow for now from the need to condition the model for Apple.

vatsalaggarwal commented 7 months ago

whoops... i had forgotten to git push lol, so sorry! can you try now, the last bit should work?

yeah, it is VERY slow, although I'm not quite sure why... I know that ANE is much faster MPS, but these speeds almost feel like CPU speeds, so not sure what's going on... @fakerybakery any ideas?

groovybits commented 7 months ago

Works! I see these warnings, and of course 4 minutes for about 10 seconds on M2 Ultra :)

/opt/homebrew/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:26: UserWarning: The operator 'aten::_weight_norm_interface' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.)
  return _weight_norm(v, g, self.dim)

Saved audio to /Users/chris/code/rsllm/metavoice-src-mps2/samples/synth_Hi_I_am_an_AI_that_loves__fbe42b51-c4a5-4725-8e2e-fc0a0c234a5a.wav
2024-02-21 15:34:12 | INFO     | DF | Running on torch 2.1.0
2024-02-21 15:34:12 | INFO     | DF | Running on host earth
2024-02-21 15:34:13 | INFO     | DF | Git commit: eb7338abb, branch: stable
2024-02-21 15:34:13 | INFO     | DF | Loading model settings of DeepFilterNet3
2024-02-21 15:34:13 | INFO     | DF | Using DeepFilterNet3 model at /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3
2024-02-21 15:34:13 | INFO     | DF | Initializing model `deepfilternet3`
2024-02-21 15:34:13 | INFO     | DF | Found checkpoint /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
2024-02-21 15:34:13 | INFO     | DF | Running on device cpu
2024-02-21 15:34:13 | INFO     | DF | Model loaded
2024-02-21 15:34:13 | WARNING  | DF | Audio sampling rate does not match model sampling rate (24000, 48000). Resampling...
/opt/homebrew/lib/python3.11/site-packages/torchaudio/functional/functional.py:1371: UserWarning: "sinc_interpolation" resampling method name is being deprecated and replaced by "sinc_interp_hann" in the next release. The default behavior remains unchanged.
  warnings.warn(

Also in addition there is a crackle at the end of the wav file.

vatsalaggarwal commented 7 months ago

Also in addition there is a crackle at the end of the wav file.

Could be a bug / bfloat16->float16 related

groovybits commented 7 months ago

Yeah it is just a little slight one not too loud, I need to play with it more to see.

In addition I was trying another audio file of a voice I have that is 30+ seconds long. It outputs an audio file but that file sounds more like abstract tones sort of similar to what it should have said but weird sounding. I'm guessing it's unrelated, It seems like it must be particular about the voice sample given (it's the same format/codec etc as the example mp3).

Thanks!

vatsalaggarwal commented 7 months ago

That's weird. @groovybits can you try the same file as reference via ttsdemo.themetavoice.xyz ? Curious to understand if it's the model or fp16 vs bf16

I would also use 256kbps mp3 instead of 64kbps as in example. Ideally, high quality wav.

groovybits commented 7 months ago

It's an old voice mail recording of low quality from years ago. I tried to enhance it and get it sort of sounding more like a kermit the frog version of the voice. Yet only if I "enhance" it at resemble.ai, else denoise there and other places cause the fully non-working one to output. I think it's beyond repair for now in general, but it is odd there is something wrong till "enhance" is ran on it from that sites model.

groovybits commented 7 months ago

I see that candle rust has a metavoice PR that is "in progress" but rather along. It seems like a good focus for MPS/Metal to give a Rust TTS option (sorely lacking out there for local models which are this good).

https://github.com/huggingface/candle/pull/1717

Is that something that looks close? I would love to help but am a bit lost on where to begin, yet definitely am needing this really badly...

My all-in-Rust project needs a TTS/LLM/TTI set to get the full thinking/speaking/image multi-model output :P https://github.com/groovybits/rsllm so I have a lot of motivation to help with the Rust part especially.

vatsalaggarwal commented 7 months ago

Thanks @groovybits... it's unclear to me how much you can squeeze performance out of MPS :( I have more experience with ANE side of things... I looked at the PR and left some comments, we need the following things:

some adapters around data (e.g. tokenising text, convert the outputs from 1st stage into the format that the 2nd stage expects)
ability to decode encodec tokens (either MBD + DFN or Vocos or RQVAE decoder)

We are also super keen to make this happen, a bit constrained on capacity but I've left a note for the contributor to see if there is a way we can find to collaborate to make this happen sooner

vatsalaggarwal commented 7 months ago

@groovybits / other folks on this thread https://github.com/huggingface/candle/pull/1717 has been merged in, if you want to give it a try? There are some instructions here: https://github.com/huggingface/candle/tree/main/candle-examples/examples/metavoice

I think the quality is not the same as the Python implementation though as it's not an exact copy of the Python implementation yet, so just something to keep in mind!

groovybits commented 7 months ago

Yes this is great, it's coming along in Rust and I have it implemented in the pre-formed state already in my Rust application!

I can see the speedup's being worked on for metal will be necessary to get closer to realtime speeds. It's really slow there still in terms of seeming to possibly get slower per longer sentences potentially (like short sentences 2x realtime right now, longer the more 4x + times it takes I think, metal speedup's I tested in Candle PR did half the speed at least on small input). I'm not 100% sure yet since just started trying the Metal optimizations for Candle in Rust and seem like are "almost done". It has the wavy sound, but sounds like it is in good hands and being worked on. Definitely very very cool to have it this far already, had not expected to get it into my Rust program unless via a python based API server :D

groovybits commented 7 months ago

Oh also I see the previous Python methods do not work on MPS anymore (missing sample.py, the new script isn't the same?).

Is there a branch with fixes for MPS currently for Python? I was wanting to compare and see the differences and explore some :).

groovybits commented 7 months ago

Update: I took the older mps changes and merged the newest main into that, but get this...

chris@earth metavoice-src-groovybits % time PYTORCH_ENABLE_MPS_FALLBACK=1 python3.11 fam/llm/fast_inference.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/bria.mp3" --text "hi how are you doing? I am going to the store to buy some AI food. Please let me know if you use GPT-4 and how much Anime you like to watch."
objc[74688]: Class AVFFrameReceiver is implemented in both /opt/homebrew/lib/python3.11/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x1067b4760) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x2d8880370). One of the two will be used. Which one is undefined.
objc[74688]: Class AVFAudioReceiver is implemented in both /opt/homebrew/lib/python3.11/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x1067b47b0) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x2d88803c0). One of the two will be used. Which one is undefined.
/opt/homebrew/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/opt/homebrew/lib/python3.11/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
  from torchaudio.backend.common import AudioMetaData
using dtype=float16
using dtype=float16
Fetching 6 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 70492.50it/s]
number of parameters: 14.07M
2024-03-06 17:37:42 | INFO     | DF | Running on torch 2.1.0
2024-03-06 17:37:42 | INFO     | DF | Running on host earth
2024-03-06 17:37:42 | INFO     | DF | Git commit: c6d959218, branch: stable
2024-03-06 17:37:42 | INFO     | DF | Loading model settings of DeepFilterNet3
2024-03-06 17:37:42 | INFO     | DF | Using DeepFilterNet3 model at /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3
2024-03-06 17:37:42 | INFO     | DF | Initializing model `deepfilternet3`
2024-03-06 17:37:42 | INFO     | DF | Found checkpoint /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
2024-03-06 17:37:42 | INFO     | DF | Running on device cpu
2024-03-06 17:37:42 | INFO     | DF | Model loaded
Using device=cpu
Loading model ...
using dtype=float16
Time to load model: 0.88 seconds
Compiling...Can take up to 2 mins.
[2024-03-06 17:37:47,865] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_24
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_23
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_22
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_21
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_20
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_19
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_18
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_17
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_16
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_15
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_14
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_13
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_12
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_11
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_10
[2024-03-06 17:37:47,866] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_9
[2024-03-06 17:37:47,867] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_8
[2024-03-06 17:37:47,867] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_7
[2024-03-06 17:37:47,867] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_6
[2024-03-06 17:37:47,867] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_5
[2024-03-06 17:37:47,867] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_4
[2024-03-06 17:37:47,867] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_3
[2024-03-06 17:37:47,867] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_2
[2024-03-06 17:37:47,867] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split_1
[2024-03-06 17:37:47,867] [0/0] torch._inductor.fx_passes.split_cat: [WARNING] example value absent for node: split
/opt/homebrew/lib/python3.11/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/opt/homebrew/lib/python3.11/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/opt/homebrew/lib/python3.11/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/opt/homebrew/lib/python3.11/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
Traceback (most recent call last):
  File "/Users/chris/code/metavoice/metavoice-src-groovybits/fam/llm/fast_inference.py", line 143, in <module>
    tts = TTS()
          ^^^^^
  File "/Users/chris/code/metavoice/metavoice-src-groovybits/fam/llm/fast_inference.py", line 69, in __init__
    self.model, self.tokenizer, self.smodel, self.model_size = build_model(
                                                               ^^^^^^^^^^^^
  File "/Users/chris/code/metavoice/metavoice-src-groovybits/fam/llm/fast_inference_utils.py", line 362, in build_model
    y = generate(
        ^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/metavoice/metavoice-src-groovybits/fam/llm/fast_inference_utils.py", line 208, in generate
    next_token = prefill(model, prompt.view(1, -1).repeat(2, 1), spk_emb, input_pos, **sampling_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/metavoice/metavoice-src-groovybits/fam/llm/fast_inference_utils.py", line 120, in prefill
    def prefill(
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 3905, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1482, in g
    return f(*args)
           ^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 2533, in runtime_wrapper
    all_outs = call_func_with_args(
               ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1506, in call_func_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1594, in rng_functionalization_wrapper
    return compiled_fw(args)
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 374, in __call__
    return self.get_current_callable()(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 401, in _run_from_cache
    return compiled_graph.compiled_artifact(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/xf/xfgx0g516xj7h447j0rkp5yr0000gn/T/torchinductor_chris/ye/cyehh3z5rft6glubp27enf6sqnh2ugnnlhw4dwx4ta4ah5wo3jqi.py", line 11365, in call
    extern_kernels.mm(arg225_1, reinterpret_tensor(arg51_1, (256, 2048), (1, 256), 0), out=buf0)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
PYTORCH_ENABLE_MPS_FALLBACK=1 python3.11 fam/llm/fast_inference.py   --text   34.58s user 24.74s system 159% cpu 37.110 total

Which my branch is here where I merged and am looking if it's just newer added parts needing changes for mps, seems to be the case. RuntimeError: "addmm_implcpu" not implemented for 'Half' seems suspiciously like that type of issue.

https://github.com/groovybits/metavoice-src/compare/main...groovybits:metavoice-src:mps

groovybits commented 7 months ago

Also when I use the previous mps branch, it creates a 48khz wav that plays really fast? Something seems strange there, didn't think it did that before :/ It says while generating it is different sample rate than set and it is adjusting, which doesn't end up well at all.