globalwordnet / cili

The Global WordNet Association Collaborative Inter-Lingual Index
Other
40 stars 8 forks source link

Bad mappings in tab file for PWN31 #13

Closed goodmami closed 2 years ago

goodmami commented 2 years ago

There are some ILI links with skos:closeMatch in the Turtle file for the PWN31 mapping but these are showing up in the corresponding tab file simply as ILI links. These should be removed or fixed somehow.

$ grep closeMatch ../cili/ili-map-wn31.ttl 
ili:i10601 skos:closeMatch pwn31:301940473-s . # objective, documentary
ili:i10765 skos:closeMatch pwn31:301972355-s . # unasked, unsolicited
ili:i13487 skos:closeMatch pwn31:302445314-s . # impossible, insufferable, unacceptable, unsufferable
ili:i14242 skos:closeMatch pwn31:302532138-s . # unintentional, unwilled
ili:i14242 skos:closeMatch pwn31:301342529-s . # unintentional, unwilled
ili:i34549 skos:closeMatch pwn31:202578494-v . # victimize, swindle, rook, goldbrick, nobble, diddle, bunco, defraud, scam, mulct, gyp, gip, hornswoggle, short-change, con
ili:i39442 skos:closeMatch pwn31:100767587-n . # crime, offense, criminal_offense, criminal_offence, offence, law-breaking
ili:i39502 skos:closeMatch pwn31:100781071-n . # bunco, bunco_game, bunko, bunko_game, con, confidence_trick, confidence_game, con_game, gyp, hustle, sting, flimflam
ili:i40396 skos:closeMatch pwn31:100951435-n . # technology, engineering
ili:i40396 skos:closeMatch pwn31:100951878-n . # technology, engineering
ili:i49794 skos:closeMatch pwn31:102675726-n . # accordion, piano_accordion, squeeze_box
ili:i50032 skos:closeMatch pwn31:102716628-n . # annulet
ili:i50032 skos:closeMatch pwn31:102716929-n . # annulet
ili:i50034 skos:closeMatch pwn31:102716929-n . # annulet, roundel
ili:i50034 skos:closeMatch pwn31:102716785-n . # annulet, roundel
ili:i50490 skos:closeMatch pwn31:102799187-n . # barrel_organ, grind_organ, hand_organ, hurdy_gurdy, hurdy-gurdy, street_organ
ili:i56414 skos:closeMatch pwn31:104455013-n . # mousse, hair_mousse, hair_gel
ili:i63228 skos:closeMatch pwn31:107059027-n . # forte, fortissimo
ili:i63228 skos:closeMatch pwn31:107059160-n . # forte, fortissimo
ili:i67468 skos:closeMatch pwn31:105828731-n . # technicality, trifle, triviality
ili:i72354 skos:closeMatch pwn31:106836790-n . # umlaut, dieresis, diaeresis
ili:i73625 skos:closeMatch pwn31:107048857-n . # plainsong, plainchant, Gregorian_chant
ili:i89444 skos:closeMatch pwn31:110020122-n . # deist, freethinker
ili:i89454 skos:closeMatch pwn31:110021663-n . # democrat, populist
ili:i89516 skos:closeMatch pwn31:110032289-n . # dietician, dietitian, nutritionist
ili:i90722 skos:closeMatch pwn31:110230422-n . # interior_designer, designer, interior_decorator, house_decorator, room_decorator, decorator
ili:i102993 skos:closeMatch pwn31:112599160-n . # mung, mung_bean, green_gram, golden_gram, Vigna_radiata, Phaseolus_aureus

It looks like the skos:closeMatch ones usually have the ILI in in the OEWN:

$ grep 'id="oewn-00767587-n"' english-wordnet-2021.xml 
    <Synset id="oewn-00767587-n" ili="in" members="oewn-offense-n oewn-offence-n" partOfSpeech="n" dc:subject="noun.act">

I tried to determine if this is always the case:

$ # get offset-pos for ILI mappings with "closeMatch" (remove superfluous leading digit on IDs)
$ grep 'closeMatch' ../cili/ili-map-wn31.ttl | sed 's/.*pwn31:[0-9]\([^ ]*\) .*/\1/' | sort -n > closematches
$ # get offset-pos for synsets with ili="in"
$ grep 'ili="in"' english-wordnet-2021.xml | sed 's/.*id="oewn-\([^"]*\)".*/\1/' | sort -n > ins
$ # OEWN has many introduced ILIs not part of this set, so show those only in ILI mapping
$ comm ins closematches -13
01342529-s
02532138-s
02716628-n
02716929-n
$ # Further inspect those synsets in CILI
$ grep '01342529-s\|02532138-s\|02716628-n\|02716929-n' ../cili/ili-map-wn31.ttl 
ili:i7301 owl:sameAs pwn31:301342529-s . # unintentional, unplanned, unwitting
ili:i13987 owl:sameAs pwn31:302532138-s . # unwilled
ili:i14242 skos:closeMatch pwn31:302532138-s . # unintentional, unwilled
ili:i14242 skos:closeMatch pwn31:301342529-s . # unintentional, unwilled
ili:i50032 skos:closeMatch pwn31:102716628-n . # annulet
ili:i50032 skos:closeMatch pwn31:102716929-n . # annulet
ili:i50033 owl:sameAs pwn31:102716628-n . # annulet, bandelet, bandelette, bandlet, square_and_rabbet
ili:i50034 skos:closeMatch pwn31:102716929-n . # annulet, roundel

So it looks like 3 of 4 closeMatch ILI links are parallel to a sameAs link to the same synset. The other one only had closeMatch links and no sameAs to the same synset for two ILIs. That synset has ili="in" (it wasn't excluded before because there were two ILIs (i50032 and i50034) which pointed to the same synset, and comm did not exclude the second instance). Digging into that one further:

$ # Only closeMatch links for i50032 and i50034 for PWN31
$ grep i5003[24] ../cili/ili-map-wn31.ttl 
ili:i50032 skos:closeMatch pwn31:102716628-n . # annulet
ili:i50032 skos:closeMatch pwn31:102716929-n . # annulet
ili:i50034 skos:closeMatch pwn31:102716929-n . # annulet, roundel
ili:i50034 skos:closeMatch pwn31:102716785-n . # annulet, roundel
$ # Only sameAs links for both in PWN30
$ grep i5003[24] ../cili/ili-map-wn30.ttl 
<i50032>    owl:sameAs  pwn30:02713769-n . # annulet
<i50034>    owl:sameAs  pwn30:02713992-n . # annulet, roundel

We don't have a precedent or a good way to add introduced ILIs to the mapping. Assigning ili="in" is done for a wordnet project, and here we should generate new IDs. So I suppose all of these closeMatch cases should simply be dropped from the .tab files?

fcbond commented 2 years ago

OK, I rebuilt the mapping (ili-map-pwn31.tab) as follows:

grep sameAs ../cili/ili-map-wn31.ttl | cut -d ' ' -f1,3 --output-delimiter=$' ' | sed s/pwn31:[12345]// | sed s/ili://| sort -nk 1.2 > ili-map-pwn31.tab

This keeps only the sameAs links.

goodmami commented 2 years ago

Somehow the file had null (\0) delimiters instead of tab delimiters between the fields. The following command produces tabs:

$ sed -rn '/owl:sameAs/{s/ili:([^ ]*) owl:sameAs pwn31:[1-5]([^ ]*) .*/\1\t\2/;p}' ili-map-wn31.ttl | sort -nk 1.2 > ili-map-pwn31.tab

The sort command is unnecessary; the same results are obtained without it as the Turtle file is already sorted in this order. But it's also good to be explicit.

I'll check in the new file.

fcbond commented 2 years ago

I checked with ediff-buffers, don't know how that crept in.

Thanks for the fix.

On Thu, Nov 4, 2021 at 7:45 AM Michael Wayne Goodman < @.***> wrote:

Somehow the file had null (\0) delimiters instead of tab delimiters between the fields. The following command produces tabs:

$ sed -rn '/owl:sameAs/{s/ili:([^ ]) owl:sameAs pwn31:[1-5]([^ ]) .*/\1\t\2/;p}' ili-map-wn31.ttl | sort -nk 1.2 > ili-map-pwn31.tab

The sort command is unnecessary; the same results are obtained without it as the Turtle file is already sorted in this order. But it's also good to be explicit.

I'll check in the new file.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/globalwordnet/cili/issues/13#issuecomment-960299325, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRUVZQ2I7M7UOTD2RGTUKHJTXANCNFSM5HHFKPKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 2 years ago

Yeah, not sure. I tried out your commands and it seemed to work. Strange.