Closed goodmami closed 2 years ago
Thank you for spotting this. I found more duplicate entries, see Excel file.
OdeNet Redundant lexical entries.xlsx
The file contains:
Thanks! I didn't report those that merely had the same lemma and pos, although, as you see, there are many, because it's not clear that they are problematic. For instance, the 2021 version of the Open English WordNet also has many seeming duplicates by this method, but they are now distinguished by pronunciation, such as:
I don't speak German so I cannot verify whether any of these duplicates in OdeNet are problematic. I think it would require looking at definitions, if present, or ILI correspondences, to determine if they are in fact different words.
From what I see, most if not all of them should be deleted. Maybe the less frequent entries can be marked as duplicates for the time being? I think we should reverse the burden of proof and ask why a duplicate should be kept.
I think if there is no distinguishing information (like different pronunciation), then it would probably be good to merge them. In this case it would make sense to mark the entry with the least information as the duplicate (although I guess if you list them as pairs, then it doesn't matter).
On Fri, Oct 15, 2021 at 4:56 AM rwingerter55 @.***> wrote:
From what I see, most if not all of them should be deleted. Maybe the less frequent entries can be marked as duplicates for the time being?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hdaSprachtechnologie/odenet/issues/36#issuecomment-943717129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRRLX2PDAE2OFPYKHQLUG46LLANCNFSM5F4JSFHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
@fcbond, I will see what I can do, but it won't be until next week.
The validation script already contains a test for duplicate lexical entries: find_duplicate_lexentries In principle, these shouldn't be kept. Though, it is often not clear, which is the entry to be kept, so it needs manual inspection. I have done some work on these, but it takes a lot of time...
We have worked on some entries with different spelling, the work is described in this paper: Declerck, Thierry, Bajcetic, Lenka and Siegel, Melanie (2020). Adding Pronunciation Information to Wordnets. In: Proceedings of the Workshop on Multimodal Wordnets (MMWN-2020), pp. 39–44.
I have solved the 45 issues above. The reason they are in is that openThesaurus had comments in brackets as part of the lexicon entry, and I deleted the comments automatically.
In the latest version it looks like one was missed:
$ sed -rn '/id="(w13313|w117091)"/,/<\/LexicalEntry>/p' ../odenet/odenet/wordnet/deWordNet.xml
<LexicalEntry id="w13313">
<Lemma writtenForm="typisch" partOfSpeech="a"/>
<Sense id="w13313_2862-a" synset="odenet-2862-a"/>
<Sense id="w13313_4782-a" synset="odenet-4782-a"/>
<Sense id="w13313_7616-a" synset="odenet-7616-a"/>
<Sense id="w13313_35105-a" synset="odenet-35105-a">
<SenseRelation relType="pertainym" target="odenet-7616-a"/>
</Sense>
</LexicalEntry>
<LexicalEntry id="w117091">
<Lemma writtenForm="typisch" partOfSpeech="a"/>
<Sense id="w117091_35105-a" synset="odenet-35105-a">
<SenseRelation relType="pertainym" target="odenet-7616-a"/>
</Sense>
</LexicalEntry>
The second is wholly subsumed by the first, so it can be removed instead of merged.
The Excel file below contains a table with a proposal which entries to keep and which ones to merge (Columns "EntryID" and "merge with").
Taking up the suggestion by @fcbond, I computed a score for each lexical entry (counting properties, like partOfSpeech, confidenceScore, number of senses). The idea is to keep the LexicalEntry with the highest score, and add information from duplicate entries to it. If there is more than one entry with the highest score, we should keep the entry with the lowest EntryID.
Merging redundant lexical entries.xlsx
See Excel file for more details.
Please do not merge yet, I will check the proposed merges manually.
A new version of the Excel file is attached below. The new version now includes dc:description (as an attribute of lexical entry) and how to deal with it when merging duplicate lexical entries.
The table of duplicate entries in the Excel file contains the columns "EntryID" and "merge with". They tell us which entries to merge.
Merging an entry (EntryID) X with a preferred Entry (prefEntryID) Y means
Column "Keep description" (0|1) tells us to keep the description of the prefEntry (1) or not (0). Descriptions (if any) of entry X are not transferred to the prefEntry.
Merging redundant lexical entries v2.xlsx
@hdaSprachtechnologie, do you agree with the proposal?
Is the idea to delete all entries that have anything written in column "merge with"?
That's right. And note column "Keep description", see my remarks in the Excel file.
I have now deleted these. There are 600 duplicated lexical entries left. duplicated_lexentries.txt
They have different parts-of-speech assigned. This cannot be resolved without further analysis. I will see what I can do.
The solution required to check the part of speech and sense of each entry for correctness. Since there is no guideline concerning part of speech for multiword entries, I only dealt with single term entries.
The attachment contains a table indicating which entries to remove (Column "Remove" (0|1).
solved
The following 45 lines represent redundant lexical entries. That is, for each synset below, there is more than one lexical entry with the given word form. The duplicated entries should be removed or merged.
For example, here are the lexical entries for the first one, where two entries with form neu point to the synset
odenet-7244-a
. In this case the second should be deleted.