Open hoodji opened 5 years ago
You need to use the startOffset and endOffset fields to recover punctuation from the original transscript
On Feb 9, 2019, at 02:24, hoodji notifications@github.com wrote:
Hi, I am tryingf to match trhe gentle results to the original transcript word by word. In doing so I note that Gentle does some parsing of the transcript, so far I have noticed that:
Punctuation such as ? ! , ; : . is removed Non spoken characters are removed such as - Words with hyphens in them are split into two words
I am addressing each barrier to matching on a trial and error basis, and am now wondering if there is a way in which I can find the complete set of rules Gentle uses when it creates it's list of words to find?
Any help appreciated
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks for responding .. and Offsets work for the simple case of punctuation at the end , but not so good for hyphenated words such as dog-leg which gentle splits into two words. My sense is that I can handle most situations using logic, what I need to know is all the situations that could occur, ie Gentle's logic for deciding what a spoken word is when it parses the transcript
For the benefit of anyone else struggling with this, the 'tokenisation' takes place in metasentence.py and the regular expression used is there.
Hi, I am tryingf to match trhe gentle results to the original transcript word by word. In doing so I note that Gentle does some parsing of the transcript, so far I have noticed that:
Punctuation such as ? ! , ; : . is removed Non spoken characters are removed such as - Words with hyphens in them are split into two words
I am addressing each barrier to matching on a trial and error basis, and am now wondering if there is a way in which I can find the complete set of rules Gentle uses when it creates it's list of words to find?
Any help appreciated