jaraco / inflect

Correctly generate plurals, ordinals, indefinite articles; convert numbers to words
https://pypi.org/project/inflect
MIT License
978 stars 106 forks source link

A/An exceptions #6

Closed josepvalls closed 12 years ago

josepvalls commented 13 years ago

A/An should be assigned based on phonetics and not ortographic representation. There should be either a little list for know exceptions or a phonetic transcription functionality (with it's own exceptions). The former would be easier.

An herb A herpes

http://owl.english.purdue.edu/owl/resource/591/01/

pwdyson commented 13 years ago

Thanks for the feedback. I agree that a/an should be assigned based on phonetics, and I think that is what the module does. If you have any other cases where it appears to be incorrect, please let me know.

In the case of "herb" the problem is with the pronunciation. From what I can gather, the 'h' is sounded in all English speaking countries except the United States. In the US, people say "urb", hot "hurb". The module follows the Oxford English Dictionary, which sounds the 'h', and so "a herb" is correct. If more US exceptions arise I could consider a "US" mode for departures from the Oxford.

http://www.wordreference.com/definition/herb https://secure.wikimedia.org/wikipedia/en/wiki/Herb#Pronunciation

josepvalls commented 13 years ago

Thanks for your quick and through response. I speak English as a second language and I know little about linguistics nor enough python to be able to help. I would like to share with you the code I've developed to see if it could be of any use. I've been using this database for my phonetics lookups: http://www.keithv.com/software/giga/ According to the description it is indeed a mainly American English corpus. I found out 3224 discrepancies from 69705 phonetic transcriptions (the herb example was by chance).

import inflect
import fileinput
inflector = inflect.engine()
file = 'lm_giga_64k_nvp.sphinx.dic'
i = 0
e = 0
for line in fileinput.input(file):
    i += 1
    word,phonemas = line.split("\t")
    det = 'an' if phonemas.split(' ')[0] in ['aa','ae','ah','ao','aw','ay','eh','er','ey','ow','oy','uh','uw'] else 'a'
    inflected = inflector.a(word).split(' ')[0]
    if not inflected == det:
        e += 1
        print "%s: \t%s\t%s\t%s" % (word,inflected,det,phonemas),
print i,e,float(e)/i*100

I can dump the output and send it to you or commit it somewhere if it could be of any help.

pwdyson commented 13 years ago

Thanks. Send me the list and I'll have a look at it. I'll send you my email address.

pwdyson commented 12 years ago

Thanks for the list. I've used it to fix a bug and add some more exceptions.

josepvalls commented 12 years ago

You are welcome! Not sure whether I could be of any further help, just let me know. Some sort of American/British switcher would be definitely great!