lucidrains / soundstorm-pytorch

Implementation of SoundStorm, Efficient Parallel Audio Generation from Google Deepmind, in Pytorch
MIT License
1.37k stars 85 forks source link

Can you update read me for these 2? #9

Closed FurkanGozukara closed 1 year ago

FurkanGozukara commented 1 year ago

How to Generate speech i don't see any examples

How to train our audio? I don't see any examples either

Thank you

korakoe commented 1 year ago

I’m also curious about text conditioned audio generation, is there a rough estimate for when that will be implemented?

olup commented 1 year ago

Isn't soundstorm supposed to be a semantic to audio generator, and so it needs to be used with a broader architecture like spear-tss or audio-lm for a complete audio generating pipeline ? At least that's what I get from the abstract. Is this project aiming at end to end TTS or is that (as one could expect) only the soundstorm element ?

lucidrains commented 1 year ago

so i'm less familiar with the current literature for TTS, but i believe what they are doing is generating a conditional tensor aligned in the time dimension and directly summing it into the input embeddings of the conformer maskgit. can someone point me at the current best solution for phoneme alignment? is it related to the pull request over at naturalspeech2-pytorch? (https://arxiv.org/abs/2108.10447)

lucidrains commented 1 year ago

will be going offline soon for travel, but hope to wrap up all TTS related open source projects by next month's end

lucidrains commented 1 year ago

Isn't soundstorm supposed to be a semantic to audio generator, and so it needs to be used with a broader architecture like spear-tss or audio-lm for a complete audio generating pipeline ? At least that's what I get from the abstract. Is this project aiming at end to end TTS or is that (as one could expect) only the soundstorm element ?

it uses the soundstream from audiolm

as it stands now, it should be good for unconditional synthesis. for conditional, i had originally planned to just cross attend to text with this repository, but now i realize alignment is an issue for non-autoregressive solutions, and that this is still an active research topic

FurkanGozukara commented 1 year ago

Isn't soundstorm supposed to be a semantic to audio generator, and so it needs to be used with a broader architecture like spear-tss or audio-lm for a complete audio generating pipeline ? At least that's what I get from the abstract. Is this project aiming at end to end TTS or is that (as one could expect) only the soundstorm element ?

it uses the soundstream from audiolm

as it stands now, it should be good for unconditional synthesis. for conditional, i had originally planned to just cross attend to text with this repository, but now i realize alignment is an issue for non-autoregressive solutions, and that this is still an active research topic

unconditional synthesis means random voices?

but your show cases are text to speech how come?

olup commented 1 year ago

What show cases ?

Unconditional synthesis means random generation reflecting the training distribution (it will show that it works, but will not be a prompted text to speech tool on its own)

FurkanGozukara commented 1 year ago

What show cases ?

Unconditional synthesis means random generation reflecting the training distribution (it will show that it works, but will not be a prompted text to speech tool on its own)

seriously how did you find out this repo?

here : https://google-research.github.io/seanet/soundstorm/examples/

lucidrains commented 1 year ago

Isn't soundstorm supposed to be a semantic to audio generator, and so it needs to be used with a broader architecture like spear-tss or audio-lm for a complete audio generating pipeline ? At least that's what I get from the abstract. Is this project aiming at end to end TTS or is that (as one could expect) only the soundstorm element ?

it uses the soundstream from audiolm as it stands now, it should be good for unconditional synthesis. for conditional, i had originally planned to just cross attend to text with this repository, but now i realize alignment is an issue for non-autoregressive solutions, and that this is still an active research topic

unconditional synthesis means random voices?

but your show cases are text to speech how come?

right.. it is not done yet. watch for the 'work in progress' flag to be removed

olup commented 1 year ago

@FurkanGozukara the page you link are samples from Google, not the author or this project. This project aim at reproducing those results, but it's an aim.

Also if you read the abstract:

SoundStorm, coupled with the text-to-semantic modeling stage of SPEAR-TTS (Kharitonov et al., 2023), can synthesize high quality, natural dialogues, allowing one to control the spoken content (via transcripts), speaker voices (via short voice prompts) and speaker turns (via transcript annotations).

I am glad the author seems to be considering an end to end pipeline, but reading from the paper you could need to couple this with another body of work (like @collabora spear-tts-pytorch) to have a complete tts system.

FurkanGozukara commented 1 year ago

@FurkanGozukara the page you link are samples from Google, not the author or this project. This project aim at reproducing those results, but it's an aim.

Also if you read the abstract:

SoundStorm, coupled with the text-to-semantic modeling stage of SPEAR-TTS (Kharitonov et al., 2023), can synthesize high quality, natural dialogues, allowing one to control the spoken content (via transcripts), speaker voices (via short voice prompts) and speaker turns (via transcript annotations).

I am glad the author seems to be considering an end to end pipeline, but reading from the paper you could need to couple this with another body of work (like @collabora spear-tts-pytorch) to have a complete tts system.

so this author is not the author of those google samples but someone else?

i see

lucidrains commented 1 year ago

@FurkanGozukara that's correct, i'm not one of the paper authors. i believe google will no longer be open sourcing, but evidently still publishing. not sure how they will retain researchers otherwise

@olup oh i see, they use a component of spear-tts! related to https://github.com/lucidrains/audiolm-pytorch/issues/84 i'll see what i can do by next month's end. all we need here is the text-to-semantic module

bharani-y commented 1 year ago

@lucidrains does this soundstorm implementation support voice imitation feature by using voice prompts or does it need to be done by external text-to-semantic module used in spear-tts?

Thanks