hdaSprachtechnologie / odenet

Open German WordNet
Creative Commons Attribution Share Alike 4.0 International
87 stars 30 forks source link

Mismatched part-of-speech on hypernyms #31

Closed goodmami closed 2 years ago

goodmami commented 3 years ago

Generally we expect that the hypernyms of a synset will have the same part-of-speech (technically the "synset type") as itself. The Open English WordNet, for example, checks for such errors in its validation script. OdeNet has 1261 synset-hypernym pairs where the part-of-speech does not agree:

>>> import wn
>>> odenet = wn.Wordnet('odenet')
>>> hyp_mismatches = [
...   (ss, hyp)
...   for pos in 'nvar'  # OdeNet does not have synsets with pos='s'
...   for ss in odenet.synsets(pos=pos)
...   for hyp in ss.hypernyms()
...   if hyp.pos != ss.pos
... ]
>>> len(hyp_mismatches)
1261

Here's a sample of the first 10:

>>> for ss, hyp in hyp_mismatches[:10]:
...   print(ss, ss.lemmas())
...   print(hyp, hyp.lemmas())
...   print('-' * 20)
... 
Synset('odenet-10026-n') ['Kunststoff', 'Plastik', 'Plaste', 'Plast', 'organisches Polymer']
Synset('odenet-26274-a') ['stichhaltig', 'sicher', 'wasserdicht', 'gesichert', 'sicher wie das Amen in der Kirche', 'handfest', 'hieb- und stichfest', 'belastbar']
--------------------
Synset('odenet-10046-n') ['Perversion', 'Abartigkeit', 'Perversität']
Synset('odenet-28696-v') ['sich krümmen', 'sich beugen', 'sich biegen']
--------------------
Synset('odenet-10069-n') ['Beauftragter', 'Bote', 'Bevollmächtigter', 'Delegierter', 'Emissär', 'Abgesandter', 'Kurier', 'Parlamentär', 'Delegat', 'Ordonnanz']
Synset('odenet-35108-a') ['bezeichnend', '(jemandem/einer Sache) eigen', '(jemandem/einer Sache) eigentümlich', 'charakteristisch für', 'gekennzeichnet', 'kennzeichnend', 'symptomatisch für', 'spezifisch für', 'charakterisiert', 'typisch für']
--------------------
Synset('odenet-10088-n') ['Prüfungsteilnehmer', 'Proband', 'Prüfling', 'Prüfungskandidat']
Synset('odenet-7482-a') ['jemand', 'irgendjemand', 'jeder beliebige']
--------------------
Synset('odenet-10107-n') ['Waffengang', 'Duell', 'Zweikampf']
Synset('odenet-2505-v') ['strampeln', 'ringen']
--------------------
Synset('odenet-10125-n') ['Hermaphrodit', 'Intersex', 'Zwitter', 'Gynander']
Synset('odenet-5953-a') ['hybrid', 'zwitterhaft']
--------------------
Synset('odenet-10271-n') ['Biegung', 'Beugung', 'Flexion']
Synset('odenet-28696-v') ['sich krümmen', 'sich beugen', 'sich biegen']
--------------------
Synset('odenet-1039-n') ['Ernte', 'Ernteertrag', 'Lese', 'Auslese']
Synset('odenet-479-v') ['billigen', 'Zustimmung geben', 'grünes Licht geben', 'verabschieden', 'einwilligen', 'Okay geben', 'Segen geben', 'abnicken', 'Placet geben', 'seine Zustimmung erteilen', 'erlauben', 'absegnen', 'in Kraft setzen', 'genehmigen', 'Erlaubnis erteilen', 'zustimmen']
--------------------
Synset('odenet-10475-n') ['Scheck', 'Bankanweisung']
Synset('odenet-4253-v') ['abkommandieren', 'detachieren', 'auswählen']
--------------------
Synset('odenet-10558-n') ['abzappeln', 'tanzen', 'abtanzen', 'schwofen', 'das Tanzbein schwingen', 'abhotten']
Synset('odenet-12904-v') ['bewegen', 'in Bewegung setzen']
--------------------
hdaSprachtechnologie commented 2 years ago

I have currently no good idea on how to fix these automatically. Maybe just delete all those cases?

goodmami commented 2 years ago

Probably. But see below for a different view on things, where I count synsets by the total number of hypernyms and how many of them have the wrong POS. For those where all hypernyms have the wrong POS, I also count how many might have a good hypernym by mirroring the Open English WordNet.

>>> import wn
>>> de = wn.Wordnet('odenet')
>>> en = wn.Wordnet('ewn:2021')  # prerelease
>>> ewn_alternatives = 0  # if only mismatched hypernyms, is there a good hypernym in OEWN?
>>> mismatch_counts = defaultdict(int)  # {(num_hyps, num_mismatched): count}
>>> for ss in de.synsets():
...   hyps = ss.hypernyms()
...   miscnt = (len(hyps), sum(1 for hyp in hyps if hyp.pos != ss.pos))
...   mismatch_counts[miscnt] += 1
...   # if all hypernyms have mismatched POS and the synset has an ILI, maybe we can get a hypernym from OEWN
...   if miscnt[1] > 0 and miscnt[0] == miscnt[1] and ss.ili is not None:
...     for enss in en.synsets(ili=ss.ili.id):
...       if any(enhyp.translate(lexicon='odenet') for enhyp in enss.hypernyms()):
...         ewn_alternatives += 1
...         break
... 
>>> for (num_hyps, num_mismatch), count in mismatch_counts.items():
...   print(f'{num_hyps} hypernyms, of which {num_mismatch} have a mismatched POS : {count}')
... 
1 hypernyms, of which 0 have a mismatched POS : 7967
0 hypernyms, of which 0 have a mismatched POS : 26310
1 hypernyms, of which 1 have a mismatched POS : 1199
2 hypernyms, of which 0 have a mismatched POS : 685
2 hypernyms, of which 1 have a mismatched POS : 58
3 hypernyms, of which 0 have a mismatched POS : 42
3 hypernyms, of which 1 have a mismatched POS : 5
4 hypernyms, of which 1 have a mismatched POS : 1
4 hypernyms, of which 0 have a mismatched POS : 1
>>> ewn_alternatives
271

Good news: 64 synsets with a hypernym with a mismatched part-of-speech also have another hypernym with the correct POS, and of the 1199 synsets with only a single hypernym where the hypernym's POS is mismatched, 271 have a corresponding synset in the Open English WordNet such that (1) it has a hypernym, and (2) that hypernym has a corresponding synset in OdeNet. These ones are easy to fix automatically, and the number is manageable for hand-verification.

The remaining 928 should probably just have the hypernym links removed.

I also noticed that the total here (1199 + 58 + 5 + 1 == 1263) is slightly higher than my earlier count of 1261. I believe this is because I only looked at the POS values 'n', 'v', 'a', and 'r', but there are others in OdeNet, and no 'r':

>>> {ss.pos for ss in de.synsets()}
{'x', 'p', 'a', 'v', 'n'}
rwingerter55 commented 2 years ago

FWIW, here is the PoS count I get in my RDF repository.

partOfSpeech count
noun 75823
verb 22973
adjective 20873
adposition 397
other_pos 41

other_pos contains phrases. adposition contain nouns, verbs and adjectives.

goodmami commented 2 years ago

@rwingerter55 thanks, but for this issue I'm looking at synsets and not lexical entries.

By the way, I wondered if the work to reduce redundant lexical entries might have resulted in some empty synsets getting pruned which might reduce the synsets with mismatched-pos hypernyms, but it appears to have had little effect. Here are the numbers (same code as above, but sorted) from the latest version of OdeNet:

0 hypernyms, of which 0 have a mismatched POS : 26300
1 hypernyms, of which 0 have a mismatched POS : 7972
1 hypernyms, of which 1 have a mismatched POS : 1197
2 hypernyms, of which 0 have a mismatched POS : 684
2 hypernyms, of which 1 have a mismatched POS : 58
3 hypernyms, of which 0 have a mismatched POS : 42
3 hypernyms, of which 1 have a mismatched POS : 5
4 hypernyms, of which 0 have a mismatched POS : 1
4 hypernyms, of which 1 have a mismatched POS : 1
hdaSprachtechnologie commented 2 years ago

In the process of reducing redundant lexical entries, I have deleted the (few) synsets that were empty then.

hdaSprachtechnologie commented 2 years ago

I have now deleted all relations with mismatched POS.