Open NicoCaldo opened 7 months ago
This is the single most important feature which would make bark a pro product. AWS and Azure's API's offer this out of the box. With bark we'd need to build a cumbersome pipeline where we first STT bark's output and than word-match that with the og text which is anything than non-trivial.
It would be even better if bark would output not just the word and the timecode (begin and end like with AWS) BUT also the og index of that word in the og source file. Then, matching source and audio would be a breeze. @gkucsko what do you think?
@gkucsko FYI, here why ElevenLabs' timestamp feature is extremely good:
It does really output the og input file, byte by byte, including new lines, so not some slightly altered string, or interpreted something based from some post-STT process, this is absolutely amazing, saves so much headache
Before we go all-in on ElevenLabs, is something similar already in development or just somewhere lying around in your backlogs?
FWIW, ElevenLabs detects emotions and tonality from the dialogue attribution, e.g. "Hi", he said softly
, so the softly
tells ElevenLabs to speak that Hi
softly, so need for special tokens which is also kind of neat.
This is more a broad question about the model/project instead of a real issue.
Is it possible with the actual project to timestamp the generated audio, word by word? Something like generated .srt subtitle files?