omwn / omw-data

This packages up data for the Open Multilingual Wordnet
42 stars 3 forks source link

Duplicate sense IDs from upper/lower case distinctions #17

Closed goodmami closed 2 years ago

goodmami commented 2 years ago

In the Python wndb2lmf.py script, the sense keys of lexical entries distinguished only by upper/lower case letters are the same. Since the sense keys are now used as sense IDs, this is a problem. It happens for all letters a-z and a handful of other entries:

$ xmlstarlet val -e -d etc/WN-LMF-1.1.dtd build/omw-1.4/wn30/wn30.xml 
...
build/omw-1.4/wn30/wn30.xml:65535.0: ID wn30-a__1.10.00.. already defined
build/omw-1.4/wn30/wn30.xml:65535.0: ID wn30-b__1.10.00.. already defined
build/omw-1.4/wn30/wn30.xml:65535.0: ID wn30-baroque__3.01.00.. already defined
build/omw-1.4/wn30/wn30.xml:65535.0: ID wn30-c__1.10.00.. already defined
build/omw-1.4/wn30/wn30.xml:65535.0: ID wn30-d__1.10.00.. already defined

To be clear, there are now two <LexicalEntry> elements for each of these, one for each case distinction, and they point to the same synset with senses having the same ID, as shown above.

In the Open English Wordnet 2021 (and maybe earlier, I didn't check), the "lexid" part of the sense key differs for these:

$ grep 'id="oewn-baroque__3\.01' english-wordnet-2021.xml 
      <Sense id="oewn-baroque__3.01.00.." synset="oewn-02985568-a">
      <Sense id="oewn-baroque__3.01.01.." synset="oewn-02985568-a">
$ #                                 ^^ these are the lexids

However these are not the lexids given by the WNDB files:

$ grep -P '^baroque%3' etc/WordNet-3.0/dict/index.sense 
baroque%3:01:00:: 02974023 2 0
(env) goodmami@pop-os:~/postdoc/omw-data$ grep -P '^02974023' etc/WordNet-3.0/dict/data.adj 
02974023 01 a 02 baroque 0 Baroque 0 002 \ 15259076 n 0201 \ 15259076 n 0101 | of or relating to or characteristic of the elaborately ornamented style of architecture, art, and music popular in Europe between 1600 and 1750
$ #                      ^         ^  the lexid of both is 0

Note: the situation is the same in WN 3.1, and also in the output from gwn-scala-api:

$ grep '"baroque%3:01' ../gwn-scala-api/wn30.xml 
      <Sense id="pwn30-baroque-a-02974023-01" synset="pwn30-02974023-a" dc:identifier="baroque%3:01:00::">
      <Sense id="pwn30-Baroque-a-02974023-02" synset="pwn30-02974023-a" dc:identifier="baroque%3:01:00::">
$ #                    ^ sense ID differs by case                              lexids are over here ^^

@jmccrae, do you recall when/where you changed the lexids for these? I wonder if there's some principled way to do it, or if I just increment it when I see a duplicate?

@fcbond once this is fixed, it's another thing to document as a difference from PWN.

fcbond commented 2 years ago

it is discussed a little here: https://github.com/globalwordnet/english-wordnet/issues/153

But I am not sure if the fix was done automatically or by hand (or documented).

I agree that this is a bug in PWN, and agree it is good to fix it, and in a compatible way with OEWN.

goodmami commented 2 years ago

Thanks @fcbond! I was digging through some OEWN issues but I must've missed that one.

I agree that this is a bug in PWN

Actually I'm not sure it is a bug in PWN unless those lexids should in fact differ, since (I think) WNDB doesn't really have the same conception of lexical entry as WN-LMF does. I don't think we should try to emulate the case insensitivity though, and should instead just try to do what is correct for WN-LMF.

fcbond commented 2 years ago

Sense keys are meant to be stable unique identifiers, so I am pretty sure it is a bug if two senses have the same key, ...

On Sat, Oct 30, 2021 at 11:11 AM Michael Wayne Goodman < @.***> wrote:

Thanks @fcbond https://github.com/fcbond! I was digging through some OEWN issues but I must've missed that one.

I agree that this is a bug in PWN

Actually I'm not sure it is a bug in PWN unless those lexids should in fact differ, since (I think) WNDB doesn't really have the same conception of lexical entry as WN-LMF does. I don't think we should try to emulate the case insensitivity though, and should instead just try to do what is correct for WN-LMF.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bond-lab/omw-data/issues/17#issuecomment-955136327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRX2PFT4MITF5TX6LZDUJNO5JANCNFSM5HAIRVOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 2 years ago

Note: in a separate chat we decided it might be easiest to just revert to putting the sense key in the dc:identifier attribute on <Sense> elements. This way they wouldn't have to be perfectly unique.

fcbond commented 2 years ago

And then we will follow the current NLTK's practice of just returning a random sense when two senses have the same key.

On Sat, Oct 30, 2021 at 12:42 PM Michael Wayne Goodman < @.***> wrote:

Note: in a separate chat we decided it might be easiest to just revert to putting the sense key in the dc:identifier attribute on elements. This way they wouldn't have to be perfectly unique.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bond-lab/omw-data/issues/17#issuecomment-955146878, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRR7CLYWPJRUXSEH4S3UJNZSBANCNFSM5HAIRVOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University