suno-ai / bark

🔊 Text-Prompted Generative Audio Model
MIT License
35.46k stars 4.17k forks source link

Timestamp audio generated #530

Open NicoCaldo opened 7 months ago

NicoCaldo commented 7 months ago

This is more a broad question about the model/project instead of a real issue.

Is it possible with the actual project to timestamp the generated audio, word by word? Something like generated .srt subtitle files?

newsve commented 6 days ago

This is the single most important feature which would make bark a pro product. AWS and Azure's API's offer this out of the box. With bark we'd need to build a cumbersome pipeline where we first STT bark's output and than word-match that with the og text which is anything than non-trivial.

It would be even better if bark would output not just the word and the timecode (begin and end like with AWS) BUT also the og index of that word in the og source file. Then, matching source and audio would be a breeze. @gkucsko what do you think?

newsve commented 3 days ago

@gkucsko FYI, here why ElevenLabs' timestamp feature is extremely good:

It does really output the og input file, byte by byte, including new lines, so not some slightly altered string, or interpreted something based from some post-STT process, this is absolutely amazing, saves so much headache

Before we go all-in on ElevenLabs, is something similar already in development or just somewhere lying around in your backlogs?

FWIW, ElevenLabs detects emotions and tonality from the dialogue attribution, e.g. "Hi", he said softly, so the softly tells ElevenLabs to speak that Hi softly, so need for special tokens which is also kind of neat.