globalwordnet / OMW

The Open Multilingual Wordnet
http://compling.hss.ntu.edu.sg/omw/
MIT License
58 stars 9 forks source link

PWN does not distinguish scientific names and formulae from normal words #44

Open fcbond opened 5 years ago

fcbond commented 5 years ago

lemmas like genus Rhynia or Canis familiaris should be marked as special forms sform

<Lemma writtenForm='Canis familarias' partOfSpeech='n'>
  <Tag category = 'sform'>scientific name</Tag>
</Lemma> 
<Lemma writtenForm='H₂O' partOfSpeech='n'>
  <Tag category = 'sform'>chemical formula</Tag>
  <Form = 'H20'>
    <Tag category = 'sform'>chemical formula</Tag>
  </Form>
</Lemma> 
fcbond commented 5 years ago

You can identify them like this:

from nltk.corpus import wordnet as pwn

things="domain kingdom phylum class order family genus".split()

genus = set()
for l in pwn.all_lemma_names('n'):
    if l.startswith('genus_'):
        #print(l[6:])
        genus.add(l[6:].title())

for ss in pwn.all_synsets('n'):
    if ss.lexname() not in ['noun.animal', 'noun.plant']:
        continue
    for l in ss.lemma_names(): 
        ll = l.strip().split('_')
        if len(ll) >= 2:
            print (ll, ss.definition())
            if ll[0] in genus and ll[1] not in ['bug']:
                print ("\t".join([pwn.ss2of(ss), l.replace('_',' ')]))
            elif(ll[0] in things):
                 print ("\t".join([pwn.ss2of(ss), ll[1].title()]))

Or from SQL something like

sqlite> select substr(lemma,7)  from f where lemma glob "genus *" limit 5;
substr(lemma,7)
---------------
Heliobacter    
Aerobacter     
Rhizobium      
Agrobacterium  
Bacillus     
arademaker commented 5 years ago

The goal is to have special ways to present such terms in the interface or something deeper?

fcbond commented 5 years ago

They behave differently from other NPs (non-headed, don't inflect) so I would like to distinguish them on those grounds. It is also useful to distinguish them for linking to other resources.

jmccrae commented 5 years ago

Hi please see the DTD. The correct XML would be

<Lemma writtenForm='Canis familarias' partOfSpeech='n'>
  <Tag category = 'sform'>scientific name</Tag>
</Lemma> 
<Lemma writtenForm='H₂O' partOfSpeech='n'>
  <Tag category = 'sform'>chemical formula</Tag>
</Lemma>
<Form writtenForm= 'H20'>
  <Tag category = 'sform'>chemical formula</Tag>
</Form>
fcbond commented 5 years ago

Thanks, silly mistake on my part. I was hoping for comments on the content (e.g. 'sform', 'scientific name' and 'chemical formula'). I think we also discussed 'atomic symbol'.

i will go ahead with these for now.

jmccrae commented 5 years ago

Hi Francis,

It would be good to have a fixed list of names here, but maybe this is hard at this point. Currently, for example in LexInfo we have the following list of values for term types:

fcbond commented 5 years ago

Thanks. Not quite a fixed set yet then :-).

I will go ahead with what we have.

On Tue, Feb 26, 2019 at 5:33 PM John McCrae notifications@github.com wrote:

Hi Francis,

It would be good to have a fixed list of names here, but maybe this is hard at this point. Currently, for example in LexInfo we have the following list of values for term types:

  • 'abbreviated form'
  • 'clipped term'
  • 'common name'
  • 'compound(cjkv)'
  • 'entry term'
  • 'full form'
  • 'international scientific term'
  • 'logical expression'
  • 'part number'
  • 'phraseological unit'
  • 'set phrase'
  • 'short form'
  • 'standard text'
  • 'transcribed form'
  • abbreviation
  • acronym
  • appellation
  • compound
  • contraction
  • equation
  • expression
  • formula
  • idiom
  • initialism
  • internationalism
  • nucleus
  • productName
  • proverb
  • sku
  • string
  • stringCategory
  • symbol

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/globalwordnet/OMW/issues/44#issuecomment-467365870, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xgGjlygr_mg_eSWl8SsFRp4KlVmKks5vRP9sgaJpZM4bOZOE .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University