Duplicates in Tab and LMF files

omwn / omw-data

This packages up data for the Open Multilingual Wordnet

42 stars 3 forks source link

Duplicates in Tab and LMF files #32

Open ekaf opened 1 year ago

ekaf commented 1 year ago

The tab-file side of this issue is described in nltk_data #194.

The problem with the simultaneous presence of both underscore and space variants also yields many duplicates in LMF:

<Synset id="omw-fr-01847978-n" ili="i45053" partOfSpeech="n" members="omw-fr-canard_noir-01847978-n omw-fr-anas_rubripes-01847978-n omw-fr-canard_noir-01847978-n" />

Additionally, there are many thousands spurious uppercase forms, which are just errors coming from the upstream wordnets (and may never be remedied):

<Synset id="omw-fr-03625646-n" ili="i55432" partOfSpeech="n" members="omw-fr-aiguilles_à_tricoter-03625646-n omw-fr-Aiguille_à_tricoter-03625646-n omw-fr-aiguille_à_tricoter-03625646-n" />

<Synset id="omw-pt-00369399-n" ili="i37352" partOfSpeech="n" members="omw-pt-Contratura-00369399-n omw-pt-contratura-00369399-n" />

The latter problem is not easy to deal with, since some uppercase forms are allowable.

goodmami commented 1 year ago

Thanks, @ekaf! I wrote a quick script to detect duplicates by looking for multiple entries for the same offset+pos when the lemmas are the same after normalizing by case and replacing _ with and I was able to replicate your findings. You're right that it's not obvious which form to use when they differ by upper/lower case.

It seems there's two courses of action here for fixing the issue:

Correct the original .tab files
Detect and prevent duplicates during the conversion to WN-LMF

I would vote for option (1). @fcbond, what about you?

goodmami commented 1 year ago

I implemented a third check: stripping diacritics. There are many more duplicates of this type, and like the upper/lower cases, these are hard to fix automatically. For example:

WARNING:tsv-duplicates:duplicate of 00021939-n: 'artefact', 'artéfact'
WARNING:tsv-duplicates:duplicate of 00029378-n: 'évènement', 'événement'
WARNING:tsv-duplicates:duplicate of 00064691-r: 'dument', 'dûment'

My method of counting is different than what you reported in https://github.com/nltk/nltk_data/issues/194. Perhaps this is because I count the number of synsets and lemmas affected by duplication (e.g., if a synset had both ad hoc and ad_hoc, there would be 2 affected lemmas). Also, I am counting from the .tab files and not the WN-LMF files; I don't recall if the conversion to WN-LMF already removes some kinds of duplicates.

The total duplicate counts are as follows for the OMW lexicons:

normalization	synsets	lemmas
case folding	13018	26333
underscores	2589	5249
diacritics	6559	13305

The lemma totals are very close to 2x the synset totals, meaning that there are usually just 2 lemmas that look like duplicates for each category.

Also see https://github.com/omwn/omw-data/pull/33 which has the script I used.

ekaf commented 1 year ago

@goodmami, thanks for raising the interesting question of diacritics. These are less obvious to solve, unlike the easy underscore/space alternation. Actually, the variants without diacritics are very frequent in social web corpora, because many users cannot type the accents on their phone keyboard. So, even though academically incorrect, many unaccented forms are quite well attested. But in analogy with conjugated forms, and singular/plural variants, they arguably belong in a morphological component, and not in a wordnet of lemmas. So there may also be a case for removing plural forms like the French "journaux":

03822171-n fra:lemma journal 03822171-n fra:lemma journal|journau 03822171-n fra:lemma journaux

Please note that the "journal|journau" form is just an aberration, just like the hallucination that "journal" could be a French verb:

01746604-v fra:lemma journal

At the easier end of the scale, we find spurious quotes:

04539203-n fra:lemma Terrarium 04539203-n fra:lemma terrarium 04539203-n fra:lemma « terrarium »

All this leads to a broader discussion concerning how much an agregator like OMW should mingle with the original wordnets. I would not hesitate to remove obvious mistakes like the underscore/space duplicates. But beyond that, there is room for diverging viewpoints.

Concerning the method to count duplicates, for each synset, I just consider the difference between the length of the original lemma list, and the corresponding set of normalized forms.

goodmami commented 1 year ago

@ekaf Quick follow-up: thanks for these further examples. Some points:

I just checked and we are already normalizing _ to when converting the TSV to WN-LMF format. If you are using the TSV format directly you won't get that improvement unless we fix it in the source files.
We also already remove surrounding double-quotes, but not yet other quoting styles like «...»
Diacritics are not just present or not, but there are alternative uses. E.g., Pârvatî, Pārvatī, and Pârvâti (09527560-n in the French wordnet).
We do keep track of plurals when they are irregular, but for predictable forms it would be good to identify and remove them. When they are kept, they should be alternative forms of a base lemma and not independent lemmas.

I think we would need someone with good knowledge of the language and morphology to fix many of these issues. We could improve the process by identifying potential issues and their solutions, so the human only needs to select the desired course of action, but I don't think we can get around having human annotators to fix the upper/lower case, diacritics, and plurals.

ekaf commented 1 year ago

For words like "Pârvâti", which are transliterated from a foreign alphabet, the lists of possible forms tend to be arbitrary and incomplete. These alternations are orthographic variants and not "duplicates". Dealing with such issues would constitute a fork of the original wordnets, which seems ambitious, considering that OMW still needs to catch up with more timely wordnet data, like the MCR 2016 release (issue #25).

The more realistic approach would be to concentrate on solving the simpler errors in the TSV database, like underscore/spaces and spurious quotes, and then leave it to downstream libraries like Wn or NLTK to check for other duplicates caused by uppercase/lowercase variants or merged synsets (see NLTK #3125).

goodmami commented 1 year ago

In the "Pârvâti" case, I think it might be better to encode them as alternative forms instead of as entirely different lexical entries. What we have now is:

    <LexicalEntry id="fra-Pârvatî-n">
      <Lemma writtenForm="Pârvatî" partOfSpeech="n" />
      <Sense id="fra-Pârvatî-09527560-n" synset="fra-09527560-n" />
    </LexicalEntry>
    <LexicalEntry id="fra-Pārvatī-n">
      <Lemma writtenForm="Pārvatī" partOfSpeech="n" />
      <Sense id="fra-Pārvatī-09527560-n" synset="fra-09527560-n" />
    </LexicalEntry>
    <LexicalEntry id="fra-Pârvâti-n">
      <Lemma writtenForm="Pârvâti" partOfSpeech="n" />
      <Sense id="fra-Pârvâti-09527560-n" synset="fra-09527560-n" />
    </LexicalEntry>

And it seems like this would be an improvement:

    <LexicalEntry id="fra-Pârvatî-n">
      <Lemma writtenForm="Pârvatî" partOfSpeech="n" />
      <Form writtenForm="Pārvatī" />
      <Form writtenForm="Pârvâti" />
      <Sense id="fra-Pârvatî-09527560-n" synset="fra-09527560-n" />
    </LexicalEntry>

But each group of word forms may need to be decided individually, so I think we're in agreement that only the simple duplicate cases can be handled automatically.