nltk / nltk_data

NLTK Data
1.38k stars 1.02k forks source link

Thousands of duplicate entries in OMW packages #194

Closed ekaf closed 5 months ago

ekaf commented 1 year ago

Both omw and omw-1.4 include many multiword lemmas where the words are separated using both a space and an underscore as word separator. For ex:

01847978-n  fra:lemma   canard noir
01847978-n  fra:lemma   canard_noir

In NLTK, the canonical word separator is the underscore, so the spaces in OMW are translated to underscore at load time, and without a check for duplicates (https://github.com/nltk/nltk/issues/3125#issue-1599776906), this yields thousands of duplicates.

Actually, a check for duplicates is performed when producing the OMW packages, but in order to catch these cases, the spaces would need to first be translated to underscores, before applying the duplicate check.

ekaf commented 1 year ago

After loading omw-1.4 with the wordnet library, the following numbers of duplicates are present in memory, in various OMW dictionaries:

1 duplicates in bul offset
2 duplicates in bul exe
4 duplicates in ell offset
44 duplicates in fin offset
20 duplicates in fra lemma
3597 duplicates in fra offset
29 duplicates in hrv offset
1 duplicates in isl lemma
1 duplicates in isl offset
58 duplicates in ita offset
2 duplicates in ita exe
7 duplicates in ita_iwn def
21 duplicates in jpn offset
38 duplicates in jpn def
29 duplicates in jpn exe
2 duplicates in cat lemma
45 duplicates in cat offset
3 duplicates in eus offset
5 duplicates in glg offset
125 duplicates in spa offset
110 duplicates in ind offset
11 duplicates in zsm offset
197 duplicates in nld offset
42 duplicates in pol offset
6427 duplicates in por offset
2 duplicates in ron offset
11 duplicates in lit offset
35 duplicates in slk offset
4 duplicates in slv lemma
22 duplicates in slv offset

Total: 10895 duplicates

We see that the worst numbers are in French and Portuguese, and these are mostly due to spurious uppercase forms found in the upstream wordnets of these languages:

wn-data-por.tab:00369399-n  lemma   Contratura
wn-data-por.tab:00369399-n  lemma   contratura
wn-data-fra.tab:06304059-n  fra:lemma   Nomenclature
wn-data-fra.tab:06304059-n  fra:lemma   nomenclature

These cases are difficult to solve in the data package, because sometimes both casings could be allowable. So the proper solution would need to come from the upstream wordnet lexicographers, which may well never happen. In the meantime, the more realistic approach would be to check for duplicates at load time in the wordnet library, as in NLTK #3126, especially since such a check is needed anyway, in order to remedy duplicates stemming from other causes, like merged synsets, or underscore / space variants.

On the other hand, it could eventually make sense to also remove the underscore / space duplicates from the OMW data packages, even though this would only solve a part of the problem.

ekaf commented 5 months ago

Fixed in NLTK #3126