hdaSprachtechnologie / odenet

Open German WordNet
Creative Commons Attribution Share Alike 4.0 International
87 stars 30 forks source link

Redundant lexical entries #36

Closed goodmami closed 2 years ago

goodmami commented 2 years ago

The following 45 lines represent redundant lexical entries. That is, for each synset below, there is more than one lexical entry with the given word form. The duplicated entries should be removed or merged.

odenet-7244-a : neu
odenet-31047-a : voll
odenet-11054-n : Mittelpunkt
odenet-35105-a : typisch
odenet-4125-v : kaum glauben wollen
odenet-4728-v : anbändeln
odenet-31496-a : eventuell
odenet-5211-n : eingearbeitet
odenet-5444-n : Getratsche
odenet-6259-a : ohne Partner
odenet-7556-n : Auftritt
odenet-8265-a : nervlich angespannt
odenet-8265-a : unter Spannung
odenet-9697-v : weinen
odenet-3567-a : hinterfotzig
odenet-15272-n : Frankfurt
odenet-12650-v : firmieren
odenet-14096-n : Taiwan
odenet-14357-v : es gibt keine andere Möglichkeit
odenet-16451-a : es reicht
odenet-18016-a : suizidal
odenet-18549-v : Schnäppchen machen
odenet-20351-v : nicht stimmen
odenet-20686-n : Verwaltungsstab
odenet-22952-v : gering achten
odenet-25645-n : über Wochen
odenet-25810-v : gut bezahlt werden
odenet-26278-v : zusammen sein
odenet-26344-n : Grättimann
odenet-26944-v : Abgang machen
odenet-26984-a : es wird nichts
odenet-27234-n : Jüngste
odenet-27234-n : Jüngste im Bunde
odenet-27265-v : anblaffen
odenet-2171-a : schwer wiegend
odenet-27699-n : tolles Treiben
odenet-28324-n : Einsatzgruppe
odenet-28483-v : richtige Größe
odenet-29286-n : Klappladen
odenet-30780-a : du bist mir
odenet-33494-p : nicht
odenet-34399-n : irgendwelche andere
odenet-35051-a : erfreut sein
odenet-36070-v : emporschießen
odenet-16744-a : erschlagen

For example, here are the lexical entries for the first one, where two entries with form neu point to the synset odenet-7244-a. In this case the second should be deleted.

$ sed -rn '/id="(w1647|w30220)"/,/<\/LexicalEntry>/p' odenet-1.4/deWordNet.xml 
<LexicalEntry id="w1647" dc:type="Basiswortschatz" confidenceScore="1.0">
    <Lemma writtenForm="neu" partOfSpeech="a"/>
    <Sense id="w1647_336-a" synset="odenet-336-a"/>
    <Sense id="w1647_12672-a" synset="odenet-12672-a"/>
    <Sense id="w1647_7244-a" synset="odenet-7244-a"/>
    <Sense id="w1647_28091-a" synset="odenet-28091-a"/>
</LexicalEntry>
<LexicalEntry id="w30220">
    <Lemma writtenForm="neu" partOfSpeech="a"/>
    <Sense id="w30220_7244-a" synset="odenet-7244-a"/>
</LexicalEntry>
rwingerter55 commented 2 years ago

Thank you for spotting this. I found more duplicate entries, see Excel file.

OdeNet Redundant lexical entries.xlsx

The file contains:

goodmami commented 2 years ago

Thanks! I didn't report those that merely had the same lemma and pos, although, as you see, there are many, because it's not clear that they are problematic. For instance, the 2021 version of the Open English WordNet also has many seeming duplicates by this method, but they are now distinguished by pronunciation, such as:

I don't speak German so I cannot verify whether any of these duplicates in OdeNet are problematic. I think it would require looking at definitions, if present, or ILI correspondences, to determine if they are in fact different words.

rwingerter55 commented 2 years ago

From what I see, most if not all of them should be deleted. Maybe the less frequent entries can be marked as duplicates for the time being? I think we should reverse the burden of proof and ask why a duplicate should be kept.

fcbond commented 2 years ago

I think if there is no distinguishing information (like different pronunciation), then it would probably be good to merge them. In this case it would make sense to mark the entry with the least information as the duplicate (although I guess if you list them as pairs, then it doesn't matter).

On Fri, Oct 15, 2021 at 4:56 AM rwingerter55 @.***> wrote:

From what I see, most if not all of them should be deleted. Maybe the less frequent entries can be marked as duplicates for the time being?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hdaSprachtechnologie/odenet/issues/36#issuecomment-943717129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRRLX2PDAE2OFPYKHQLUG46LLANCNFSM5F4JSFHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

rwingerter55 commented 2 years ago

@fcbond, I will see what I can do, but it won't be until next week.

hdaSprachtechnologie commented 2 years ago

The validation script already contains a test for duplicate lexical entries: find_duplicate_lexentries In principle, these shouldn't be kept. Though, it is often not clear, which is the entry to be kept, so it needs manual inspection. I have done some work on these, but it takes a lot of time...

We have worked on some entries with different spelling, the work is described in this paper: Declerck, Thierry, Bajcetic, Lenka and Siegel, Melanie (2020). Adding Pronunciation Information to Wordnets. In: Proceedings of the Workshop on Multimodal Wordnets (MMWN-2020), pp. 39–44.

hdaSprachtechnologie commented 2 years ago

I have solved the 45 issues above. The reason they are in is that openThesaurus had comments in brackets as part of the lexicon entry, and I deleted the comments automatically.

goodmami commented 2 years ago

In the latest version it looks like one was missed:

$ sed -rn '/id="(w13313|w117091)"/,/<\/LexicalEntry>/p' ../odenet/odenet/wordnet/deWordNet.xml 
<LexicalEntry id="w13313">
    <Lemma writtenForm="typisch" partOfSpeech="a"/>
    <Sense id="w13313_2862-a" synset="odenet-2862-a"/>
    <Sense id="w13313_4782-a" synset="odenet-4782-a"/>
    <Sense id="w13313_7616-a" synset="odenet-7616-a"/>
    <Sense id="w13313_35105-a" synset="odenet-35105-a">
    <SenseRelation relType="pertainym" target="odenet-7616-a"/>
    </Sense>
</LexicalEntry>
<LexicalEntry id="w117091">
    <Lemma writtenForm="typisch" partOfSpeech="a"/>
    <Sense id="w117091_35105-a" synset="odenet-35105-a">
    <SenseRelation relType="pertainym" target="odenet-7616-a"/>
    </Sense>
</LexicalEntry>

The second is wholly subsumed by the first, so it can be removed instead of merged.

rwingerter55 commented 2 years ago

The Excel file below contains a table with a proposal which entries to keep and which ones to merge (Columns "EntryID" and "merge with").

Taking up the suggestion by @fcbond, I computed a score for each lexical entry (counting properties, like partOfSpeech, confidenceScore, number of senses). The idea is to keep the LexicalEntry with the highest score, and add information from duplicate entries to it. If there is more than one entry with the highest score, we should keep the entry with the lowest EntryID.

Merging redundant lexical entries.xlsx

See Excel file for more details.

rwingerter55 commented 2 years ago

Please do not merge yet, I will check the proposed merges manually.

rwingerter55 commented 2 years ago

A new version of the Excel file is attached below. The new version now includes dc:description (as an attribute of lexical entry) and how to deal with it when merging duplicate lexical entries.

The table of duplicate entries in the Excel file contains the columns "EntryID" and "merge with". They tell us which entries to merge.

Merging an entry (EntryID) X with a preferred Entry (prefEntryID) Y means

Column "Keep description" (0|1) tells us to keep the description of the prefEntry (1) or not (0). Descriptions (if any) of entry X are not transferred to the prefEntry.

Merging redundant lexical entries v2.xlsx

@hdaSprachtechnologie, do you agree with the proposal?

hdaSprachtechnologie commented 2 years ago

Is the idea to delete all entries that have anything written in column "merge with"?

rwingerter55 commented 2 years ago

That's right. And note column "Keep description", see my remarks in the Excel file.

hdaSprachtechnologie commented 2 years ago

I have now deleted these. There are 600 duplicated lexical entries left. duplicated_lexentries.txt

rwingerter55 commented 2 years ago

They have different parts-of-speech assigned. This cannot be resolved without further analysis. I will see what I can do.

rwingerter55 commented 2 years ago

The solution required to check the part of speech and sense of each entry for correctness. Since there is no guideline concerning part of speech for multiword entries, I only dealt with single term entries.

The attachment contains a table indicating which entries to remove (Column "Remove" (0|1).

Duplicates with different PoS.xlsx

hdaSprachtechnologie commented 2 years ago

solved