Open fcbond opened 5 years ago
You can identify them like this:
from nltk.corpus import wordnet as pwn
things="domain kingdom phylum class order family genus".split()
genus = set()
for l in pwn.all_lemma_names('n'):
if l.startswith('genus_'):
#print(l[6:])
genus.add(l[6:].title())
for ss in pwn.all_synsets('n'):
if ss.lexname() not in ['noun.animal', 'noun.plant']:
continue
for l in ss.lemma_names():
ll = l.strip().split('_')
if len(ll) >= 2:
print (ll, ss.definition())
if ll[0] in genus and ll[1] not in ['bug']:
print ("\t".join([pwn.ss2of(ss), l.replace('_',' ')]))
elif(ll[0] in things):
print ("\t".join([pwn.ss2of(ss), ll[1].title()]))
Or from SQL something like
sqlite> select substr(lemma,7) from f where lemma glob "genus *" limit 5;
substr(lemma,7)
---------------
Heliobacter
Aerobacter
Rhizobium
Agrobacterium
Bacillus
The goal is to have special ways to present such terms in the interface or something deeper?
They behave differently from other NPs (non-headed, don't inflect) so I would like to distinguish them on those grounds. It is also useful to distinguish them for linking to other resources.
Hi please see the DTD. The correct XML would be
<Lemma writtenForm='Canis familarias' partOfSpeech='n'>
<Tag category = 'sform'>scientific name</Tag>
</Lemma>
<Lemma writtenForm='H₂O' partOfSpeech='n'>
<Tag category = 'sform'>chemical formula</Tag>
</Lemma>
<Form writtenForm= 'H20'>
<Tag category = 'sform'>chemical formula</Tag>
</Form>
Thanks, silly mistake on my part. I was hoping for comments on the content (e.g. 'sform', 'scientific name' and 'chemical formula'). I think we also discussed 'atomic symbol'.
i will go ahead with these for now.
Hi Francis,
It would be good to have a fixed list of names here, but maybe this is hard at this point. Currently, for example in LexInfo we have the following list of values for term types:
Thanks. Not quite a fixed set yet then :-).
I will go ahead with what we have.
On Tue, Feb 26, 2019 at 5:33 PM John McCrae notifications@github.com wrote:
Hi Francis,
It would be good to have a fixed list of names here, but maybe this is hard at this point. Currently, for example in LexInfo we have the following list of values for term types:
- 'abbreviated form'
- 'clipped term'
- 'common name'
- 'compound(cjkv)'
- 'entry term'
- 'full form'
- 'international scientific term'
- 'logical expression'
- 'part number'
- 'phraseological unit'
- 'set phrase'
- 'short form'
- 'standard text'
- 'transcribed form'
- abbreviation
- acronym
- appellation
- compound
- contraction
- equation
- expression
- formula
- idiom
- initialism
- internationalism
- nucleus
- productName
- proverb
- sku
- string
- stringCategory
- symbol
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/globalwordnet/OMW/issues/44#issuecomment-467365870, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xgGjlygr_mg_eSWl8SsFRp4KlVmKks5vRP9sgaJpZM4bOZOE .
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
lemmas like genus Rhynia or Canis familiaris should be marked as special forms
sform