snakers4 / silero-models

Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
Other
4.94k stars 310 forks source link

Period token mostly missing from text enhancer model output #106

Closed abhinavkulkarni closed 2 years ago

abhinavkulkarni commented 2 years ago

Hi,

I fed this small audio to an STT engine and obtained following transcription:

good afternoon everyone my name is preet bharara and i'm united states attorney for the southern district of new york that was me a few months ago at a press conference when i still had the best job i'll ever have i oversaw prosecutions against every type of criminal you can imagine mobsters murderers and corrupt politicians for decades bernie madoff was able to launder billions of dollars and ponzi proceeds of criminal charges against general motors company related to today accused arms dealer viktor boot begins to face american justice on march eleventh of this year though i lost that job actually that's a euphemism i was fired by president donald trump himself since then a lot has happened fbi director james comey was fired robert muller was appointed to find out whether anyone in the white house colluded with russia the attorney general gets maligned by the president on a regular basis and i'm on the sidelines now so i figure that's a perfect place to launch a podcast we're going to talk to prosecutors to judges to justice department officials the investigative reporters who break these stories even some politicians and i'll be bringing them on the show for conversations that  you won't get to hear anywhere else wnyc studios and cafe are presenting our show produced by pineapple street media so head to a apple podcasts or wherever you get your podcasts and subscribe right now to stay tuned with preet pre bahar out here opera rather greens lara crate barrel high profile us attorney for manhattan preet bharara preet bharara bharara 

Feeding this as is to text enhancer model in example.ipynb produces the following output:

Good afternoon Everyone My name is Preet Bharara and I'm United States attorney for the Southern District of New York that was me a few months ago at a press conference when I still had the best job I'll ever have I oversaw prosecutions against every type of criminal, you can imagine mobsters murderers and corrupt politicians for decades Bernie Madoff was able to launder billions of dollars and ponzi proceeds of criminal charges against general Motors company related to today accused arms dealer Viktor Boot begins to face American justice on March eleventh of this year, though I lost that job actually that's a euphemism I was fired by President Donald Trump himself since then a lot has happened FBi Director James Comey was fired Robert Muller was appointed to find out whether anyone in the White House colluded with Russia the attorney general gets maligned by the president on a regular basis and I'm on the sidelines now so I figure that's a perfect place to launch a podcast we're going to talk to prosecutors to judges to Justice Department officials the investigative reporters who break these stories even some politicians and I'll be bringing them on the show for conversations that you won't get to hear anywhere else WnyC Studios and Cafe are presenting our show produced by Pineapple Street Media So head to a Apple podcasts, or Wherever you get your podcasts and subscribe right now to stay tuned with Preet pre Bahar out here Opera Rather Greens Lara Crate Barrel High profile Us Attorney for Manhattan Preet Bharara Preet Bharara Bharara.

You can see it misses almost all the periods.

Thanks!

snakers4 commented 2 years ago

Hi,

We described these limitations in the accompanying article - https://habr.com/ru/post/581960/:

We had to put a full stop somewhere (pun intended), so the following ideas were left for future work:

Support inputs consisting of several sentences;
Try model factorization and pruning (i.e. attention head pruning);
Add some relevant meta-data from the spoken utterances, i.e. pauses or intonations (or any other embedding);

Support for paragraphs consisting of several sentences will be added in next version.