moreymat / omw-graph

The Open Multilingual Wordnet in a graph database
MIT License
4 stars 0 forks source link

Redundancy in Wordnet #9

Closed zorgulle closed 10 years ago

zorgulle commented 10 years ago

We had to check and delete redundancy in wordnet files.

Example: in wn-data-fra we had two occurrence
09014850-n fre:lemma Chișinău

We also have to verify that the redundancy is not on purpose.

rhin0cer0s commented 10 years ago

I fixed this issue in run.sh function quickCleaner :

cat $file | sort | uniq | sponge $file
sed -i 1i"name:string:key\tname:string:key\ttype" $file
 ...
cat $file | sort | uniq | sponge $file
sed -i 1i"name:string:key\tvalue" $file

This sort the file, delete duplicate and add the header.

This add a dependence because sponge does not seem to be a std tool, I can use some other way in bash to make it work.

There has to be another way to deal with duplicates ( during the build process of .csv ... ) but I'm not sure it would be faster.

moreymat commented 10 years ago

When the wordnet file is read, you could maintain a set of already seen keys and write to the csv file only if the key is new. We already have quite a few dependencies, so I would prefer to avoid adding a new one (even if sponge seems nice).

moreymat commented 10 years ago

I just checked your example on Chisinau, the two entries in the LMF and tab files are not duplicates as there is a slight variation in the spelling: "Chișinău" has "s virgule souscrite", whereas "Chişinău" has "s cédille". More information (in French): http://fr.wikipedia.org/wiki/%C5%9E#Roumain

Outcome:

I assign this issue to 0.2.

rhin0cer0s commented 10 years ago

We do not use tab files anymore. Duplicates problem are fixed now thanks to lexical id which are uniq and easilier to use.

moreymat commented 10 years ago

Great. Marked a posteriori as a bug in design (my fault), fixed by using lexical ids from the LMF files as keys.

Assigned back to 0.1 and closed.