prosodylab / Prosodylab-Aligner

Python interface for forced audio alignment using HTK and SoX
http://prosodylab.org/tools/aligner/
MIT License
330 stars 77 forks source link

Dictionary issue #68

Closed MysteryPancake closed 6 years ago

MysteryPancake commented 6 years ago

Hi! I just downloaded your aligner to test it out, and I've come across a few dictionary issues.

The first problem I've encountered is that it appears many basic words are missing from the default dictionary, such as AIN'T, THAT'S, and YOU'LL.

I guess the default dictionary is for basic tests, so I tried the dictionary from the dictionary repository. Sadly these words are still missing.

I tried another dictionary, this time from CMUSphinx. This one has a lot more words, but it gave me a few errors because of some # comments in the file.

I removed these comments, but then it gave me another error of the word s being out of order. I tried to sort it with sort.py, but this made no difference.

I'd really appreciate finding a working dictionary with more words in it, including acronyms such as the ones mentioned above. Trying to fix the CMUSphinx dictionary would also be fantastic.

Thank you for this software and any help with this!

kylebgorman commented 6 years ago

The dictionary distributed here is just a trivial fork of the classic CMU one. We're not really in the pronunciation dictionary business. Why don't you just add those words to the dictionary locally?

IMO 'S should be tokenized separately, as a single word, since it can attach to literally any category of word in English:

[Peter]'s mother [The Queen of England]'s corgis [The woman I saw yesterday]'s new hat [The man you like]'s best friend

etc.

If you don't do it this way you will (slowly, inefficiently) have to add an "'S" form of the entire lexicon.

You may want to make sure you don't have "smart" apostrophes---I'm not sure the underlying HTK library takes kindly to those with respect to sort order.

On Thu, Feb 8, 2018 at 7:01 AM, MysteryPancake notifications@github.com wrote:

Hi! I just downloaded your aligner to test it out, and I've come across a few dictionary issues.

The first problem I encountered is that it appears many basic words are missing from the default dictionary, such as AIN'T and THAT'S.

I guess the default dictionary is just for basic tests, so I tried the dictionary from the dictionary repository https://github.com/prosodylab/prosodylab.dictionaries/blob/master/en.dict. These words are still missing.

So, I tried another dictionary from CMUSphinx https://github.com/cmusphinx/cmudict/blob/master/cmudict.dict. This one has a lot more words, but it gave me a few errors because of some comments in the dict file.

I removed these comments, but then it gave me another error of the word 's' being out of order. I tried to reorder it with sort.py, but this made no difference.

I'd really appreciate finding a working dictionary with more words in it, including acronyms such as the ones mentioned above. Even fixing the CMUSphinx dictionary would help.

Thank you for this software!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/prosodylab/Prosodylab-Aligner/issues/68, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJuOZtRxa3fM1SA4DezDRCpUlYCAgrmks5tSuIYgaJpZM4R-PSv .

kylebgorman commented 6 years ago

Acronyms are a whole open class, so they're not usually done in the pronunciation dictionary.

They can be read as words ("NATO"), letter sequences ("USA"), specialzied versions thereof ("NAACP") and occasionally as the full form (if I see "SRY" I might read it as "sorry"). There is a bunch of literature on how to determine which but the state of the art gives us relatively low recall, and feels out of scope for this project.

On Thu, Feb 8, 2018 at 9:28 AM, Kyle Gorman kylebgorman@gmail.com wrote:

The dictionary distributed here is just a trivial fork of the classic CMU one. We're not really in the pronunciation dictionary business. Why don't you just add those words to the dictionary locally?

IMO 'S should be tokenized separately, as a single word, since it can attach to literally any category of word in English:

[Peter]'s mother [The Queen of England]'s corgis [The woman I saw yesterday]'s new hat [The man you like]'s best friend

etc.

If you don't do it this way you will (slowly, inefficiently) have to add an "'S" form of the entire lexicon.

You may want to make sure you don't have "smart" apostrophes---I'm not sure the underlying HTK library takes kindly to those with respect to sort order.

On Thu, Feb 8, 2018 at 7:01 AM, MysteryPancake notifications@github.com wrote:

Hi! I just downloaded your aligner to test it out, and I've come across a few dictionary issues.

The first problem I encountered is that it appears many basic words are missing from the default dictionary, such as AIN'T and THAT'S.

I guess the default dictionary is just for basic tests, so I tried the dictionary from the dictionary repository https://github.com/prosodylab/prosodylab.dictionaries/blob/master/en.dict. These words are still missing.

So, I tried another dictionary from CMUSphinx https://github.com/cmusphinx/cmudict/blob/master/cmudict.dict. This one has a lot more words, but it gave me a few errors because of some comments in the dict file.

I removed these comments, but then it gave me another error of the word 's' being out of order. I tried to reorder it with sort.py, but this made no difference.

I'd really appreciate finding a working dictionary with more words in it, including acronyms such as the ones mentioned above. Even fixing the CMUSphinx dictionary would help.

Thank you for this software!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/prosodylab/Prosodylab-Aligner/issues/68, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJuOZtRxa3fM1SA4DezDRCpUlYCAgrmks5tSuIYgaJpZM4R-PSv .

MysteryPancake commented 6 years ago

Ok, thanks for the information. I didn't want to add the words myself because those are only a few examples out of the many, many words I would like to recognize. It'd probably take me longer than my lifetime to add the rest of the English language to the dictionary without using an existing model. The apostrophe advice should solve most of my problems, thanks!

One last question though, do you know why the CMUSphinx English Dictionary appears to be in the wrong order, even when sorted correctly?