kuhumcst / DanNet

The Danish WordNet as an RDF graph.
https://wordnet.dk
MIT License
19 stars 0 forks source link

Add Supersenses to ConNLL-U file #144

Closed simongray closed 2 months ago

simongray commented 2 months ago

(the final part of #141, separated into this separate task)

Also: need to write a bit of documentation about how this result was achieved.

simongray commented 2 months ago

One issue I have run into is that since I have corrected split some senses that were appearing in multiple synsets, these now do not resolve using the old sense IDs, e.g. Aserbajdsjan is both a country and the people in the country.

The only way to resolve this is to compare the definition too.

simongray commented 2 months ago

Another issue: many synsets do not have supersenses assigned since the mapping only had e.g. a noun supersense, while the group of synsets also included verbs. In such cases no supersenses can be assigned. In the Elexis dataset, this amounts to ~700 synsets.

simongray commented 2 months ago

These remaining synsets have now been added in bca42f8b3f1dd4669e815b517e68d2bd05e43f46 apart from 30 synsets which have sense IDs but do not exist in the DanNet dataset and whose descriptions are all {hyponymOf someLabel}. Since they do not reference IDs but only labels as their hypernyms, mapping these programatically is no easy task and should probably be done manually.

simongray commented 2 months ago

Talked to Bolette. The remaining missing supersense should not be added directly in the index, but rather a list should be produced based on what actually appears in the Elexis corpus. So the next task is to run through this corpus and collect every ID in use and then compare that to the sense IDs in DanNet.