Closed goodmami closed 2 years ago
it is discussed a little here: https://github.com/globalwordnet/english-wordnet/issues/153
But I am not sure if the fix was done automatically or by hand (or documented).
I agree that this is a bug in PWN, and agree it is good to fix it, and in a compatible way with OEWN.
Thanks @fcbond! I was digging through some OEWN issues but I must've missed that one.
I agree that this is a bug in PWN
Actually I'm not sure it is a bug in PWN unless those lexids should in fact differ, since (I think) WNDB doesn't really have the same conception of lexical entry as WN-LMF does. I don't think we should try to emulate the case insensitivity though, and should instead just try to do what is correct for WN-LMF.
Sense keys are meant to be stable unique identifiers, so I am pretty sure it is a bug if two senses have the same key, ...
On Sat, Oct 30, 2021 at 11:11 AM Michael Wayne Goodman < @.***> wrote:
Thanks @fcbond https://github.com/fcbond! I was digging through some OEWN issues but I must've missed that one.
I agree that this is a bug in PWN
Actually I'm not sure it is a bug in PWN unless those lexids should in fact differ, since (I think) WNDB doesn't really have the same conception of lexical entry as WN-LMF does. I don't think we should try to emulate the case insensitivity though, and should instead just try to do what is correct for WN-LMF.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bond-lab/omw-data/issues/17#issuecomment-955136327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRX2PFT4MITF5TX6LZDUJNO5JANCNFSM5HAIRVOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
Note: in a separate chat we decided it might be easiest to just revert to putting the sense key in the dc:identifier
attribute on <Sense>
elements. This way they wouldn't have to be perfectly unique.
And then we will follow the current NLTK's practice of just returning a random sense when two senses have the same key.
On Sat, Oct 30, 2021 at 12:42 PM Michael Wayne Goodman < @.***> wrote:
Note: in a separate chat we decided it might be easiest to just revert to putting the sense key in the dc:identifier attribute on
elements. This way they wouldn't have to be perfectly unique. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bond-lab/omw-data/issues/17#issuecomment-955146878, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRR7CLYWPJRUXSEH4S3UJNZSBANCNFSM5HAIRVOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
In the Python
wndb2lmf.py
script, the sense keys of lexical entries distinguished only by upper/lower case letters are the same. Since the sense keys are now used as sense IDs, this is a problem. It happens for all letters a-z and a handful of other entries:To be clear, there are now two
<LexicalEntry>
elements for each of these, one for each case distinction, and they point to the same synset with senses having the same ID, as shown above.In the Open English Wordnet 2021 (and maybe earlier, I didn't check), the "lexid" part of the sense key differs for these:
However these are not the lexids given by the WNDB files:
Note: the situation is the same in WN 3.1, and also in the output from gwn-scala-api:
@jmccrae, do you recall when/where you changed the lexids for these? I wonder if there's some principled way to do it, or if I just increment it when I see a duplicate?
@fcbond once this is fixed, it's another thing to document as a difference from PWN.