own-pt / openWordnet-PT

OpenWordnet-PT: an open access wordnet for Portuguese
http://openwordnet-pt.org
Other
154 stars 35 forks source link

check consistency #112

Open arademaker opened 8 years ago

arademaker commented 8 years ago
  1. Check PWN data agains http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html to make sure we did not lose anything.
  2. repeat the consistency check in the RDF
  3. Apply the Francis Bond's patch below and update the wn30.ttl

just to follow up on this, currently, if you exclude domains, there are 5 entries in PWN 3.0 where are two relations, all arguably unnecessary, and one a known bug. These are all fixed in PWN 3.1. We will add a test for this in the open multilingual wordnet.

In three cases there is both an 'also_see' and a 'similar_to', and we should just keep the 'similar_to'. Synset('inattentive.a.01'): forgetful.s.03 also_sees forgetful.s.03 similar_tos

Synset('chromatic.a.03'): chestnut.s.01 also_sees chestnut.s.01 similar_tos

Synset('fertile.a.01'): conceptive.s.01 also_sees conceptive.s.01 similar_tos

In one case we have both an 'entailment' and a 'hypernym', and we should just keep the 'hypernym'.

Synset('breathe.v.01'): inhale.v.02 entailments inhale.v.02 hyponyms

And the bug: 'restrain' is both its own 'hypernym' and 'hyponym' . Synset('restrain.v.01'): inhibit.v.04 hypernyms inhibit.v.04 hyponyms

If you also allow domains, then there are quite a few more (61), e.g.

Synset('knock_on.n.01'): play.n.03 hypernyms rugby.n.01 part_holonyms rugby.n.01 topic_domains

Synset('ball_game.n.01'): baseball.n.01 hyponyms baseball.n.01 topic_domains field_game.n.01 hypernyms

Synset('bioterrorism.n.01'): terrorism.n.01 hypernyms terrorism.n.01 topic_domains

I attach the full list of synsets with duplicates (including domains).

P.S. Here is the script used to detect these:

from nltk.corpus import wordnet as pwn

# relations with domains
#relations = ['also_sees', 'attributes', 'causes', 'entailments',
'hypernyms', 'hyponyms', 'instance_hypernyms', 'instance_hyponyms',
'member_holonyms', 'member_meronyms', 'part_holonyms',
'part_meronyms', 'region_domains', 'similar_tos',
'substance_holonyms', 'substance_meronyms', 'topic_domains',
'usage_domains']

# relations without domains
relations = ['also_sees', 'attributes', 'causes', 'entailments',
'hypernyms', 'hyponyms', 'instance_hypernyms', 'instance_hyponyms',
'member_holonyms', 'member_meronyms', 'part_holonyms',
'part_meronyms', 'similar_tos', 'substance_holonyms',
'substance_meronyms']

for s  in pwn.all_synsets():
   ttt = []  # everything linked to (synset, relation)
   for r in relations:
       tt = getattr(s,r)()
       ttt += [(t,r) for t in tt]
   ### check for duplicates in just synset
   justt = [t  for (t,r) in ttt]
   if len(justt) > len(set(justt)):
       print ("{}:\n{}\n\n".format(str(s),
                                   "\n".join(["{}\t{}".format(t.name(),r)
                                              for (t,r) in sorted(ttt)])))
arademaker commented 8 years ago

dupl-rel-pwn30.txt

More at https://lists.princeton.edu/cgi-bin/wa?A2=ind1603&L=wn-users&P=R86&1=wn-users&9=A&J=on&d=No+Match%3BMatch%3BMatches&z=4

arademaker commented 7 years ago

http://www.swi-prolog.org/pldoc/man?section=SyntaxAndSemantics

Podemos usar para verificar consistência do rdf ? @fcbr suggestion