Closed arirangz closed 10 years ago
On May 1, 2014, at 11:12 PM, arirangz notifications@github.com wrote:
First of all, thanks for this great script!
Thanks!
I noticed some issues and I was wondering it you have an easy way to fix it.
Hi there, unfortunately neither of these is easy to fix…I don’t consider them bugs. These problems fall under the rubric of “text normalization” in speech/language technology. The solutions tend to be somewhat complex and domain-specific, so it’s not obvious what to do here.
I have transcriptions that contain isolated number (eg. 6) and composed numbers (2008). Do you have a way to improve the script to recognize it?
You should replace digit sequences with alphabetic names thereof, e.g., “2008” to “two thousand and eight”. In fact, that example is interesting because there are (at least) two ways to say it: “two thousand eight” (if it’s a year) or “two thousand and eight” (elsewhere); I think it best to let the user resolve those offline. You can do this with a finite-state automaton or a context-free grammar (I assign this as homework in my NLP class), but incorporating it into the current system would at perhaps double the complexity of said system.
Also it doesn't recognize initials (eg. TNT).
Here’s a question, should it be “tee enn tee” or perhaps “tuh-nuht” or something else? What about the problematic “WinNT” (pronounced "win en tee”) which is a mix of abbreviation and initialism? Once again, the solution depends on your domain. See:
http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/Papers/sproatetal.pdf
One more thing is that it doesn't recognize composed words (eg. open-minded).
Now that would be slightly easier to handle. Though, why not just remove each dash? “open minded” and “open-minded” are pronounced the same as far as I can tell.
Closing this issue. The request is interesting but difficult and outside of the scope of this project. Any party interested in working on this is welcome to fork.
Hi!
First of all, thanks for this great script!
I noticed some issues and I was wondering it you have an easy way to fix it.
I have transcriptions that contain isolated number (eg. 6) and composed numbers (2008). Do you have a way to improve the script to recognize it?
Also it doesn't recognize initials (eg. TNT).
One more thing is that it doesn't recognize composed words (eg. open-minded).
Of course it can be done adding all this in the dictionary but it will need to be done for each transcript with initials or composed words.
Thank you very much!