Closed josepvalls closed 12 years ago
Thanks for the feedback. I agree that a/an should be assigned based on phonetics, and I think that is what the module does. If you have any other cases where it appears to be incorrect, please let me know.
In the case of "herb" the problem is with the pronunciation. From what I can gather, the 'h' is sounded in all English speaking countries except the United States. In the US, people say "urb", hot "hurb". The module follows the Oxford English Dictionary, which sounds the 'h', and so "a herb" is correct. If more US exceptions arise I could consider a "US" mode for departures from the Oxford.
http://www.wordreference.com/definition/herb https://secure.wikimedia.org/wikipedia/en/wiki/Herb#Pronunciation
Thanks for your quick and through response. I speak English as a second language and I know little about linguistics nor enough python to be able to help. I would like to share with you the code I've developed to see if it could be of any use. I've been using this database for my phonetics lookups: http://www.keithv.com/software/giga/ According to the description it is indeed a mainly American English corpus. I found out 3224 discrepancies from 69705 phonetic transcriptions (the herb example was by chance).
import inflect
import fileinput
inflector = inflect.engine()
file = 'lm_giga_64k_nvp.sphinx.dic'
i = 0
e = 0
for line in fileinput.input(file):
i += 1
word,phonemas = line.split("\t")
det = 'an' if phonemas.split(' ')[0] in ['aa','ae','ah','ao','aw','ay','eh','er','ey','ow','oy','uh','uw'] else 'a'
inflected = inflector.a(word).split(' ')[0]
if not inflected == det:
e += 1
print "%s: \t%s\t%s\t%s" % (word,inflected,det,phonemas),
print i,e,float(e)/i*100
I can dump the output and send it to you or commit it somewhere if it could be of any help.
Thanks. Send me the list and I'll have a look at it. I'll send you my email address.
Thanks for the list. I've used it to fix a bug and add some more exceptions.
You are welcome! Not sure whether I could be of any further help, just let me know. Some sort of American/British switcher would be definitely great!
A/An should be assigned based on phonetics and not ortographic representation. There should be either a little list for know exceptions or a phonetic transcription functionality (with it's own exceptions). The former would be easier.
An herb A herpes
http://owl.english.purdue.edu/owl/resource/591/01/