lowerquality / gentle

gentle forced aligner
https://lowerquality.com/gentle/
MIT License
1.45k stars 295 forks source link

Help with matching results to original transcript #215

Open hoodji opened 5 years ago

hoodji commented 5 years ago

Hi, I am tryingf to match trhe gentle results to the original transcript word by word. In doing so I note that Gentle does some parsing of the transcript, so far I have noticed that:

Punctuation such as ? ! , ; : . is removed Non spoken characters are removed such as - Words with hyphens in them are split into two words

I am addressing each barrier to matching on a trial and error basis, and am now wondering if there is a way in which I can find the complete set of rules Gentle uses when it creates it's list of words to find?

Any help appreciated

strob commented 5 years ago

You need to use the startOffset and endOffset fields to recover punctuation from the original transscript

On Feb 9, 2019, at 02:24, hoodji notifications@github.com wrote:

Hi, I am tryingf to match trhe gentle results to the original transcript word by word. In doing so I note that Gentle does some parsing of the transcript, so far I have noticed that:

Punctuation such as ? ! , ; : . is removed Non spoken characters are removed such as - Words with hyphens in them are split into two words

I am addressing each barrier to matching on a trial and error basis, and am now wondering if there is a way in which I can find the complete set of rules Gentle uses when it creates it's list of words to find?

Any help appreciated

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

hoodji commented 5 years ago

Thanks for responding .. and Offsets work for the simple case of punctuation at the end , but not so good for hyphenated words such as dog-leg which gentle splits into two words. My sense is that I can handle most situations using logic, what I need to know is all the situations that could occur, ie Gentle's logic for deciding what a spoken word is when it parses the transcript

hoodji commented 5 years ago

For the benefit of anyone else struggling with this, the 'tokenisation' takes place in metasentence.py and the regular expression used is there.