huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
14.43k stars 821 forks source link

MetaVoice? #1713

Open groovybits opened 4 months ago

groovybits commented 4 months ago

MetaVoice seems amazing as a TTS allowing any voice model 1 shot training instantly (sounds too good to be true).

https://github.com/metavoiceio/metavoice-src/issues/1

It has some issues with MPS of course and would be nice to put into candle, is this something that technically is possible?

I haven't looked close, but could try myself but also suspect it's a big job? Putting it on the radar if not already since I really need this :D and many others certainly do too since a missing piece in "good" TTS that is fully open/free and especially being run in Rust like this!

LaurentMazare commented 4 months ago

We're certainly lacking a good TTS example at the moment, as pointed out in #1428 (we already cover speech to text with whisper, and both image to text and text to image). I started putting up a musicgen example but didn't finish it, it's based on encodec which metavoice also use so I might well resume work on this. I think it's actually a bit of work as it's a new type of modality but probably not something impossible neither.

groovybits commented 4 months ago

We're certainly lacking a good TTS example at the moment, as pointed out in #1428 (we already cover speech to text with whisper, and both image to text and text to image). I started putting up a musicgen example but didn't finish it, it's based on encodec which metavoice also use so I might well resume work on this. I think it's actually a bit of work as it's a new type of modality but probably not something impossible neither.

Nice, yes I love musicgen too so that sounds amazing!

tlightsky commented 4 months ago

https://github.com/RVC-Boss/GPT-SoVITS seems also very good at voice generation

LaurentMazare commented 4 months ago

An initial version of metavoice is now available, #1717 , you can give this a shot with this example. Please let us know how it goes, note that speaker embeddings are not available at the moment so no voice cloning, and that quality can probably be improved.

phudtran commented 4 months ago

Just tried it out on my M2(CPU), took about a minute but it works!

groovybits commented 4 months ago

Exciting amazing progress!

I seem to get a failure with using cpu or metal. With CPU you can see here it outputs information but doesn't use GPU/CPU and sits there forever not doing anything. With Metal it outputs about a missing function...

MacBook-Pro:candle christi$ cargo run --example metavoice --release -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model." --out-file out.wav --tracing

    Finished release [optimized] target(s) in 0.32s
     Running `target/release/examples/metavoice --prompt 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.' --out-file out.wav --tracing`
avx: false, neon: false, simd128: false, f16c: false
Running on CPU, to run on GPU, build this example with `--features cuda`
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
^C
MacBook-Pro:candle christi$ cargo run --example metavoice --release --features metal -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model."

    Finished release [optimized] target(s) in 0.92s
     Running `target/release/examples/metavoice --prompt 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'`
avx: false, neon: false, simd128: false, f16c: false
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
Error: Metal error Error while loading function: "Function 'cast_bf16_f32' does not exist"

Caused by:
    Error while loading function: "Function 'cast_bf16_f32' does not exist"
Thank you!
phudtran commented 4 months ago

Exciting amazing progress!

I seem to get a failure with using cpu or metal. With CPU you can see here it outputs information but doesn't use GPU/CPU and sits there forever not doing anything. With Metal it outputs about a missing function...

MacBook-Pro:candle christi$ cargo run --example metavoice --release -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model." --out-file out.wav --tracing

    Finished release [optimized] target(s) in 0.32s
     Running `target/release/examples/metavoice --prompt 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.' --out-file out.wav --tracing`
avx: false, neon: false, simd128: false, f16c: false
Running on CPU, to run on GPU, build this example with `--features cuda`
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
^C
MacBook-Pro:candle christi$ cargo run --example metavoice --release --features metal -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model."

    Finished release [optimized] target(s) in 0.92s
     Running `target/release/examples/metavoice --prompt 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'`
avx: false, neon: false, simd128: false, f16c: false
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
Error: Metal error Error while loading function: "Function 'cast_bf16_f32' does not exist"

Caused by:
    Error while loading function: "Function 'cast_bf16_f32' does not exist"
Thank you!

Metal isn't supported yet, but for CPU it also took a bit of time for me. You just gotta let it run, it will finish eventually.

groovybits commented 4 months ago

Ah yes I see that now after quite awhile. Yes works here too, thank you!

Running on CPU, to run on GPU, build this example with `--features cuda`
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
text ids len: 55
sampling from logits...
codes: [[[1109, 1129, 1296, ...,  738,  408, 1024],
  [1024, 1024, 1024, ...,  913,  424, 1024],
  [1024, 1024, 1024, ...,  786,   36, 1024],
  ...
  [1024, 1024, 1024, ...,  881, 1011, 1024],
  [1024, 1024, 1024, ..., 1015,  853, 1024],
  [1024, 1024, 1024, ..., 1019,  948, 1024]]]
Tensor[[1, 8, 538], u32]
text_ids len: 54
audio_ids shape: [1, 8, 483]
output pcm shape: [1, 1, 154930]
LaurentMazare commented 4 months ago

Yeah it takes a bit of time to get the generation back, maybe we should have some progress bar or some other way to know that the process is not stuck. I'm also looking at getting this to run on metal though currently it doesn't seem to bring much speedup on a M2.

groovybits commented 4 months ago

Very fast now on metal M2 Ultra! Amazing job Thank you :)

chris@earth candle % time cargo run --example metavoice --release --features=metal -- --prompt "hi how are you today"
    Finished release [optimized] target(s) in 0.15s
     Running `target/release/examples/metavoice --prompt 'hi how are you today'`
avx: false, neon: true, simd128: false, f16c: false
prompt: 'hi how are you today'
[2153, 2154, 2337, 2352, 2476, 2371, 2327, 2149, 2376, 2561]
text ids len: 11
sampling from logits...
codes: [[[1129, 1130, 1313, 1328, 1452, 1347, 1303, 1125, 1352, 1537,  780,  537,  798,
    499,   91,   70,  112,  949,  949,  945,  945,  344,  561,  770,  182,  784,
    984,  793,  414,  793,  983,  890,   23,  598,  321,  224,  136,  432,  860,
    598,  224,  491,  835, 1019,   25,  619,   25,  904,  321,  224, 1019,  876,
   1019,  420,  751,  813,  368,  683,  683,  495,  402, 1022, 1022,  402, 1001,
    495,  967,  136,  651,  491,  136,  976,  491,  430,  855, 1019,  738,  855,
    106,  106,  738,  106,  106,  106, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  601,  228,  870,
    870, 1007,  945,  897,  242,  760,  264,  961,  559,  399,  438,  279,  561,
    441,  626,  269,  475,  211,  502,  726,  165,  962,  664,  673,  826,  519,
    588,  897,  265,  974,  928,  860,  144,   81,  460,  579,  259,  941,  765,
    544,  144,  947,   36,  679,  801,  549,  796,  549,  422,   36,  801,   36,
     36,  144,  792,  920,  510,  801,  519,  942,  687,  519,  404,  363,  404,
    942,  913,  518,  913,  424,  363, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  852,  508,  432,
    159,  728, 1022,  728,  956,  443,  845,  593,   77,  650,  166,  866,  812,
     96,  176,  644,  673,  647,  119,  587,   24,  818,  842,  518,  308,  915,
    675,  818,  653,  879,  710, 1000,  590,  601,  970,  204,  185,  426,  710,
    915,  907,  287,  636,  773,  946,  111,  564,  638,  564,  828,  564,  998,
    853,  775,  237,  518,   93,  859,  832,  406, 1000,  829,  879, 1007,   36,
     36,  710,   36,  982,  653, 1015, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  866,  830,  318,
    730,  601,  632,   26,  152,  730,   75,  236,  798,  537,  161,  267,  286,
    923,  575,  915,  914,  197,  993,  119,  190,  776,  614,  993,  558,  388,
    364,  255, 1016,   74,  734,  288,  522,  926,  278,   61,  529,  919,   74,
    859,  841,  471,  277,  605,  796,  970,  810,  272,  345,  353,  242,  901,
    589,  933,  878,  853,  557, 1016,  960,  443,  961,  793,  838,  962, 1022,
    866,  838,  741,  956,  673,  956, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  528, 1007,  944,
    617,  756,  676,  467,  971,  164,  502,  959,  446,  842,  452,  483,  846,
    246,  410,  493,  433,  335,  302,  317,  907, 1003,  838, 1003,  658,  154,
     39,  909,  446,  862,  804,  375,  667,  373,  616,  983,  113,  882,  736,
    454,    8,  163,  893,  899,  993,  872,  866,  551,  108,  615,   78,   63,
    822,  959,  969,  397,   90,  313, 1017,  111,  357,  413,  111,  528,  882,
    622,  606,  375,  882,  904,  528, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  701,  489,  907,
    975,   89,  960,   21,  112,  751,  905,  372,  634,  805,  112,  932,  868,
    100,  266,  501,  477,  602,   57,  253,  624,  519,  388,  611,  669,  918,
    505,   10,  238,  632,  640,  701,   96,  236,  982,  350,  704,  632,   10,
    851,  606,  880,  448,  147,  907,  658,  805,  278,  982,  621,  956,  690,
    466,  760,  757,  828,  958,  768,  314,  461,  238,  461,  995,  982, 1011,
    929,  435,   41,  986,  435, 1011, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1015,  380,  596,
    493,  685,  519,  728,  630,  581,  685,  770,  164,  152,  173,  786,  435,
    648,  720,  585,  845,  694,  647,  971,  243, 1008,  496,  579,  620,  764,
    444,  188,  994,  390,  786,  983,  632,  866,  365,  586,  928,  291,  782,
   1015,  586,  940,  718,  576,  399,  682,   16,  295,  877,  581,  402,   67,
    383,  820,  360,   28,  416,   45,  496,  675,  480,  887,  853,  291,  291,
    887,  772, 1002,  748,  900,  570, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  899,  117,  562,
    544,  766,  647,  339,   23,  125,  639,  758,  810,  636,  638,  191,  366,
    520,  288,  679,   65,  458,  968, 1019,  660,  160,  343,  701,  233,  615,
    204,  884,  562,  818,  835,  468,  529,  878,  429,  429,  472,  828,  475,
    947,  591,  777,  688,  650,  892,  458,  541,  799,  778,  791,  383,  505,
      2,  961,  737,  669,  416,  660,  401,  660,  835,  989, 1019, 1012, 1019,
    475,  975,  931,  383,  475,  975, 1024]]]
Tensor[[1, 8, 85], u32, metal:4294968481]
text_ids len: 10
audio_ids shape: [1, 8, 74]
output pcm shape: [1, 1, 24050]
cargo run --example metavoice --release --features=metal -- --prompt   
2.12s user 3.63s system 75% cpu 7.648 total

metavoice.wav.gz

A bit wavy as understood that that part is still in progress :)