suno-ai / bark

🔊 Text-Prompted Generative Audio Model
MIT License
36.3k stars 4.27k forks source link

Arbitrarily long text #21

Closed notnil closed 1 year ago

notnil commented 1 year ago

Is there a way to run on arbitrarily long text for example breaking up by max token (not splitting words)?

mcamac commented 1 year ago

The best way right now is probably to split text up into sentences or chunks, and generate separately, passing in the same speaker prompt for consistency. We could consider a better API that does this under the hood as well

notnil commented 1 year ago

Yeah it would be great to point this to larger documents.

JonathanFly commented 1 year ago

Is there a way to run on arbitrarily long text for example breaking up by max token (not splitting words)?

For now you can just eyeball what would be audio chunks less than 14 seconds long, and use the same history_prompt in all the generations.

I think it's a little trickier than just chunking on tokens, because it seems the tokens that are input into generate_text_semantic are only loosely correlated with audio length. They are just regular transformer encoded tokens. A typical couple of sentences is probably 30 something tokens, which is about 14 seconds of audio. Or it's more because the speaker is a slow talker. Or it's less because it's a fast rap song. Or half the tokens are just describing the audio and aren't actually things to be said aloud, like WOMAN: or [whispering].

generate_text_semantic() in generation.py will happily accept even multiple paragraphs - up to 256 tokens. I think the model is like, "Okay, so what 14 seconds of audio best represents these two paragraphs of words based on all the 14 second transcripts I was given during training..." But it wasn't given much text like that. So it's like, "I guess I can say anything? Maybe yell out a few words from the text prompt in the middle? Sounds good to me, based on all the clips I saw in my training."

This is the best btw. Sometimes it sounds like you caught the actors that were getting to read your text prompt between takes, or doing vocal warmups with it.

https://user-images.githubusercontent.com/163408/233495734-ee6dbc72-670d-4dfb-8764-340ec35fa899.mp4

I love this so much I'd be all over this model even if every prompt ended up like that.

10% of the time it abridges it perfectly, so there must be some segments like that in training.

If you change this 256 to a 64, what will generally happen is long outputs are properly abridged. But not super useful because (unless I'm missing something?) you can't directly decode the semantic token outputs and easily check "Oh it stopped at this word, so I'll pick up there next time."

if len(encoded_text) > 256:
        p = round((len(encoded_text) - 256) / len(encoded_text) * 100, 1)
        logger.warning(f"warning, text too long, lopping of last {p}%")
        encoded_text = encoded_text[:256]

And even if you could check, output quality seems worse if you pack it like that. Probably best to keep it simple, feed bark 10 to 14 seconds of audio at a time, use your judgement to estimate. Bark is pretty good about stopping early for text shorter than 14 seconds so the final output is pretty seamless.

gkucsko commented 1 year ago

gonna close to consolidate conversations, see here: https://github.com/suno-ai/bark/issues/79