omwn / omw-data

This packages up data for the Open Multilingual Wordnet
44 stars 3 forks source link

PWN 3.0 and 3.1 #5

Closed arademaker closed 3 years ago

arademaker commented 3 years ago

In https://github.com/globalwordnet/schemas/issues/49, I learned that in the GWA schemas the verb group and similarTo relations were merged. But does it make sense to keep calling PWN the wordnets resulting when this merge was done? I would argue that the resulting data is not PWN anymore... because we changed it... @goodmami ? @fcbond ?

goodmami commented 3 years ago

I think you have a point, @arademaker. If we revert to verb group, this presents a challenge for me in Wn, as I don't have a good way to deal with relations that are no longer part of LMF, but I can figure something out. For the OMW data, there are some other ways the data may differ (now or in the future) from the WNDB version of PWN 3.0 and 3.1:

I wonder how much harmonization we should strive for?

fcbond commented 3 years ago

I think we should document any divergences in the description but I am ok with not being identical.

On Wed, 23 Jun 2021, 12:31 pm Michael Wayne Goodman, < @.***> wrote:

I think you have a point, @arademaker https://github.com/arademaker. If we revert to verb group, this presents a challenge for me in Wn https://github.com/goodmami/wn/, as I don't have a good way to deal with relations that are no longer part of LMF, but I can figure something out. For the OMW data, there are some other ways the data may differ (now or in the future) from the WNDB version of PWN 3.0 and 3.1:

I wonder how much harmonization we should strive for?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bond-lab/omw-data/issues/5#issuecomment-866473812, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRTVT52Q7OWD4ZBHIKLTUFBPDANCNFSM47EHNM7Q .

arademaker commented 3 years ago

Hi Francis, I did that in the past, and it created a lot of confusion. I know that I have nothing to do with this repository, but with all my respect, I would suggest creating a new name for the modified data.

On the other hand, some changes are only related to the concrete data encoding and not related to the data model or data instances: items 1 and 3 from the @goodmami list above?

Another one can be the replacement of underscore by space in the lexical forms. After all, the underscores are an encoding artifact of the data into text files where spaces separate fields.

I also believe that the LMF DTD must be capable of encoding the PWN 3.0 and 3.1 as they are. At least some version of the DTD. The justification is that those datasets are pretty important, and they are stable references for many other works.

arademaker commented 3 years ago

Thinking twice about it, I realized that the two approaches have problems. If we make a copy of the PWN and make changes, how many changes will justify calling it another resource?

One point of view is that calling it another thing can be dishonest if one changes nothing or almost nothing and seek to have credits for the work that others have done. On the other hand, calling it another thing can be a way to not attribute to the Princeton team any error introduced by the changes.

For instance, in our repository, since the structure of the OWN-PT relies on the PWN, we created our own copy of PWN in RDF and called it OWN-EN (https://github.com/own-pt/openWordnet-PT/issues/168). In our case, one extra reason for calling it OWN-EN is the plans we have to make changes in the English data and expand it.

The https://wordnet.princeton.edu/license-and-commercial-use does not say much to help us...

goodmami commented 3 years ago

One possibility is to change the LMF so instead of:

...
  <Lexicon id="pwn" 
           label="Princeton WordNet 3.0" 
           language="en"
           email="fellbaum@princeton.edu"
       license="https://wordnet.princeton.edu/license-and-commercial-use"
       citation="Christiane Fellbaum. (ed.) (1998) *WordNet: An Electronic Lexical Database*, MIT Press"
           version="3.0" 
...

we have (changes on id, label, email, and version):

...
  <Lexicon id="pwn30" 
           label="Princeton WordNet 3.0 (Open Multilingual Wordnet release)" 
           language="en"
           email="bond@ieee.org"
       license="https://wordnet.princeton.edu/license-and-commercial-use"
       citation="Christiane Fellbaum. (ed.) (1998) *WordNet: An Electronic Lexical Database*, MIT Press"
           version="omw+1.3" 
...

This would also help with a versioning problem I have in Wn when we make changes to the PWN, because currently the version wouldn't change.

fcbond commented 3 years ago

For the OMW, we are assuming that we will make some changes --- part of what we do is harmonize different names.

We should clearly document what we have done, and they should ideally be reversible, in this case if someone wants to call them verb groups, they can just find all similar relations between verbs.

@jmccrae, what do you think?

But a lot of the time I see people not using verb_groups as they were added late and not easy to find. I think calling them similar makes much more sense.

ChristianeFellbaum commented 3 years ago

Coming in late--indeed, any versions that incorporate changes from PWN should not and cannot be called PWN. People can do what they want with PWN but they cannot attribute their modifications to us. Nothing has been done in Princeton for quite some time (lack of funding), so PWN has been pretty stable, for better or worse.

fcbond commented 3 years ago

Hi,

for the OMW, in the process of linking we do make some changes in (i) harmonizing names between resources (e.g. calling troponym hypernym for verbs which is common for interfaces like perl's query wordnet or most of the python interfaces) and do not always have all the information in PWN available (for example the subcategorization information). I have tried to make it clear on the web page and in the documentation that we are presenting harmonized versions (and always link back to the original).

I see these as cosmetic changes rather than transformative, so would prefer to still call the resulting resource PWN, but if you prefer I can call it something else, maybe something like OMW English wordnet based on PWN?

On Mon, Jul 19, 2021 at 4:30 AM ChristianeFellbaum @.***> wrote:

Coming in late--indeed, any versions that incorporate changes from PWN should not and cannot be called PWN. People can do what they want with PWN but they cannot attribute their modifications to us. Nothing has been done in Princeton for quite some time (lack of funding), so PWN has been pretty stable, for better or worse.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bond-lab/omw-data/issues/5#issuecomment-882112486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRUUX3KEYFPWW2CPNZ3TYM2DZANCNFSM47EHNM7Q .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

jmccrae commented 3 years ago

There is no change in the data other than the change of format. In the GWA standard we serialize Princeton's $ and & as the same relation (similar), but this is merely technical as you can easily go back to these relations by examining the part of speech.

Obviously, any change of format constitutes an alternative version to the one released by Princeton and if @ChristianeFellbaum wishes, we can refer to this format not as 'Princeton WordNet' but just as 'WordNet', but the underlying data is the same. You can generate this data from the Princeton release directly and then round-trip and regenerate the Princeton data with no loss or change of information (except for some extra metadata in the header of the XML).

goodmami commented 3 years ago

I think the conclusion here is that, unless we get explicit permission to do otherwise, we should rename the PWN resources packaged in OMW.

While calling them the "Princeton WordNet" 3.0 and 3.1 is nice because it's a claim that they are faithful reproductions of the original data in a new format, in reality there are numerous unintended differences due to bugs in the conversion process. Furthermore, even a perfect conversion could be considered a derivative.

I think we can do as @jmccrae suggests and just drop the "Princeton" and call it "WordNet" (in the XML, the lexicon IDs are wn30 and wn31 and the versions are local: 1.4+omw). Then I like @fcbond's suggestion to document the differences somewhere.

You can generate this data from the Princeton release directly and then round-trip and regenerate the Princeton data with no loss or change of information (except for some extra metadata in the header of the XML).

I haven't tested it, but this may be true because the existing known bugs (e.g., #8, #9, #10) didn't result in a loss of information. #11 is peculiar because LMF doesn't encode exceptions for words not in the wordnet anyway, so I guess we'd assume that the exception lists would be side-loaded when regenerating the WNDB data.

arademaker commented 3 years ago

Maybe it can still be relevant to have in wn library the pwn30 and pwn31 - no changes - from Princeton. We now have many alternatives forks from them: the one from OMW wordnets, the English Wordnet, the OWN-EN from us, the English version from the Polish Team etc

goodmami commented 3 years ago

@arademaker the pwn:3.0 and pwn:3.1 used by the Wn library are exactly the ones provided here, bugs and all. One goal of mine is to fix bugs in the conversion so the data more closely resembles the original Princeton data, but I still think we should call it something different for reasons discussed above. As for the existing resources loaded by the library, I'm not yet sure what to do there. Possibly I will deprecate them and when someone tries to do wn.download("pwn:3.0") it instead loads the new, e.g., wn30:1.4+omw resource with some kind of warning message.

fcbond commented 3 years ago

Hi,

In fact, in the original OMW, I had made many fixes to the definitions, and still called it PWN, but going forward I agree we need to be clearer.

On Wed, Sep 22, 2021 at 8:46 AM Michael Wayne Goodman < @.***> wrote:

@arademaker https://github.com/arademaker the pwn:3.0 and pwn:3.1 used by the Wn library are exactly the ones provided here, bugs and all. One goal of mine is to fix bugs in the conversion so the data more closely resembles the original Princeton data, but I still think we should call it something different for reasons discussed above. As for the existing resources loaded by the library, I'm not yet sure what to do there. Possibly I will deprecate them and when someone tries to do wn.download("pwn:3.0") it instead loads the new, e.g., wn30:1.4+omw resource with some kind of warning message.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bond-lab/omw-data/issues/5#issuecomment-924490547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRSOEPAMBYJZLIRUINLUDEQ3BANCNFSM47EHNM7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

ChristianeFellbaum commented 3 years ago

Hi everyone,

the name "WordNet" is trademarked by Princeton and cannot and should not be used by other, similar databases. Some projects use "wordnet" is reflects that the design of PWN is copied.

I also want to iterate that (the Princeton) WordNet is and will be the one that Princeton has released. We want to be responsible for all and only our own errors/shortcomings and not those of others. That was the intent behind trademarking the name while making the data freely available. Anybody can make changes, create now databases, etc., but the result cannot be called Princeton WordNet or anything else with WordNet in it.

Best, Christiane


From: Francis Bond @.> Sent: Tuesday, September 21, 2021 10:39 PM To: bond-lab/omw-data @.> Cc: ChristianeFellbaum @.>; Mention @.> Subject: Re: [bond-lab/omw-data] PWN 3.0 and 3.1 (#5)

Hi,

In fact, in the original OMW, I had made many fixes to the definitions, and still called it PWN, but going forward I agree we need to be clearer.

On Wed, Sep 22, 2021 at 8:46 AM Michael Wayne Goodman < @.***> wrote:

@arademaker https://github.com/arademaker the pwn:3.0 and pwn:3.1 used by the Wn library are exactly the ones provided here, bugs and all. One goal of mine is to fix bugs in the conversion so the data more closely resembles the original Princeton data, but I still think we should call it something different for reasons discussed above. As for the existing resources loaded by the library, I'm not yet sure what to do there. Possibly I will deprecate them and when someone tries to do wn.download("pwn:3.0") it instead loads the new, e.g., wn30:1.4+omw resource with some kind of warning message.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bond-lab/omw-data/issues/5#issuecomment-924490547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRSOEPAMBYJZLIRUINLUDEQ3BANCNFSM47EHNM7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bond-lab/omw-data/issues/5#issuecomment-924534381, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMKIFOSHD3YT2PUEGKTDJMLUDE6VXANCNFSM47EHNM7Q. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

goodmami commented 3 years ago

@ChristianeFellbaum thank you for the clarification. In that case, I think we can go with Francis's suggestion: the OMW English Wordnet Based on the Princeton WordNet 3.0, as it makes clear that this is a separate resource while also noting its provenance.

fcbond commented 3 years ago

We have documented the changes we made, and made it clear that this is not the original wordnet in the latest release.

some differences: