openWordnet-PT - Githubissues

goodmami / wn

A modern, interlingual wordnet interface for Python

https://wn.readthedocs.io/

MIT License

199 stars 19 forks source link

openWordnet-PT #97

Closed arademaker closed 2 years ago

arademaker commented 3 years ago

I would like to make OWN-PT data available to be directly indexed by this module instead of indexed as part of the OMW... Can you give me directions?

goodmami commented 3 years ago

That would be great. Here's the steps:

Format your wordnet in WN-LMF 1.0 or 1.1 (1.1 support is coming soon, but isn't in yet)
Ensure the WN-LMF file's id and version attributes on <Lexicon> are appropriate for the release
Publish the file online somewhere. For OdeNet I helped Melanie set up a GitHub action that validates the file and uploads it as an asset to a GitHub release. If you have your own server or file host, this is fine, just ensure it has a stable URL.

Once that's done I can add an index entry to Wn so it can be downloaded more easily. From the OMW, the project and lexicon identifier is porwn (Francis takes the ISO639-3 code and appends wn). If you want to use something else, like own-pt, then maybe we can change it for the OMW, too? What do you think about this, @fcbond?

Also, it's not required, but the more things that are linked (accurately) to ILI the better. This is especially true if you rely on, e.g., PWN for synset relations.

goodmami commented 3 years ago

Also if you want to include files like your project README, LICENSE, or citation.bib, you can distribute it as a package directory (we also do this with OdeNet). There's some documentation here, and I'm happy to answer questions.

fcbond commented 3 years ago

I would encourage you to go the package directory route, as I think it is good to include the README.

I guess I can change the name in the OMW release. This makes it clear that the new lexicon is an upgraded version.

On Fri, Feb 5, 2021 at 1:51 PM Michael Wayne Goodman < notifications@github.com> wrote:

Also if you want to include files like your project README, LICENSE, or citation.bib, you can distribute it as a package directory (we also do this with OdeNet). There's some documentation here https://wn.readthedocs.io/en/latest/guides/lexicons.html, and I'm happy to answer questions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-773807731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRTMRUN5H7H7JJBILU3S5OBMZANCNFSM4XC6SJHA .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 3 years ago

I guess I can change the name in the OMW release. This makes it clear that the new lexicon is an upgraded version.

Yes, it is the choice of the wordnet maintainer what id and version they choose when self-publishing, but I guess the question is if the OMW will stick with it's current ID/version or adopt the upstream one. The trouble with having different IDs is that they might look like different resources rather than just different versions. The trouble with harmonizing the IDs is what to do with the old one in Wn's index. I could "deprecate" the old one in some way with a warning, then proceed as if the user had used the new ID. But currently there is no functionality like that.

vcvpaiva commented 3 years ago

I'd suggest really renaming as "porwn" will give all the wrong vibes. Thanks!

goodmami commented 3 years ago

@vcvpaiva agreed! Unfortunately this isn't really up to me, but to the OWN-PT or OMW maintainers (who are both subscribed to this issue). That said, I do have write-access to the omw-data repository whence Wn gets the OMW lexicons, so if the maintainers agree I could effect that change.

arademaker commented 3 years ago

anyway, we are preparing the LMF-XML of OWN-PT for closing this issue...

arademaker commented 3 years ago

Yes, I prefer own-pt as identifier.

goodmami commented 3 years ago

@arademaker Great, then just ensure your <Lexicon> element has id="own-pt" as an attribute. If you're planning on releasing and distributing it yourself (e.g., as is currently done with EWN and OdeNet) I can simply link to the published URL with the new identifier. Or if you're planning on packaging the new version with the OMW, I'll see about making those changes.

arademaker commented 3 years ago

I will make a release in our repository. I will let you know when ready or you can also watch https://github.com/own-pt/openWordnet-PT or the issue linked above. Thank you @goodmami

arademaker commented 2 years ago

Hi @goodmami can you check the draft of my 1.0.0 release? The wn package was included as an extra artifact of the release. See https://github.com/own-pt/openWordnet-PT/releases. I am including two XML. One based on the PWN and the other the OWN-PT. I am calling them OWN-EN and OWN-PT.

arademaker commented 2 years ago

I just learned that you can't see drafts so I published it as an pre-release https://github.com/own-pt/openWordnet-PT/releases/tag/v1.0.0-alpha

vcvpaiva commented 2 years ago

Alexandre, two suggestions:

The Open Wordnet for Portuguese first official release tagged as 1.0.0 but accumulating all work done since our first publication.==> The Open Wordnet for Portuguese first official release in new format, hence tagged as 1.0.0. Includes all the work done since 2012.
Including and based on Princeton WordNet 3.0 (including the Morphosemantic Links from the Standoff files https://wordnet.princeton.edu/download) ==> Including and based on Princeton WordNet 3.0 (including Morphosemantic Links from the Standoff files https://wordnet.princeton.edu/download)

remove 'the" as this is our interpretation of the morphosemantic links.

Best, Valeria

On Tue, Sep 21, 2021 at 9:20 AM Alexandre Rademaker < @.***> wrote:

I just learned that you can't see drafts so I published it as an pre-release https://github.com/own-pt/openWordnet-PT/releases/tag/v1.0.0-alpha

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-924145757, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3H34GJMCNLH5FTKNV3TUDCWGNANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

goodmami commented 2 years ago

Thanks, now I see it. It downloads and imports without issue:

>>> import wn
>>> wn.download('https://github.com/own-pt/openWordnet-PT/releases/download/v1.0.0-alpha/own.tar.gz')
Download [##############################] (14924942/14924942 bytes) Complete
Added own-pt:1.0.0 (OpenWordnet-PT)
Added own-en:1.0.0 (OpenWordnet-EN)

PosixPath('/home/goodmami/.wn_data/downloads/65646cb71b01af18d11e33eb1c5ef6a47d8c6b54')
>>> wn.synsets('São Paulo', lang='pt')  # original lemma
[Synset('own-pt-synset-08857529-n'), Synset('own-pt-synset-11225661-n')]
>>> wn.synsets('Sao Paulo', lang='pt')  # normalized diacritics
[Synset('own-pt-synset-08857529-n'), Synset('own-pt-synset-11225661-n')]
>>> wn.synsets('São Paulo', lang='pt')[0].translate(lang='en')  # test ILI translation to English
[Synset('ewn-08876521-n'), Synset('pwn-08857529-n'), Synset('own-en-synset-08857529-n')]
>>> wn.synsets('São Paulo', lang='pt')[0].translate(lang='en')[0].lemmas()
['Sao Paulo']

Some comments:

Only packaging your OWN-EN and OWN-PT lexicons together gives users no choice in which ones to download. If they just want the OWN-PT lexicon without the other one, they should be able to do so. What we did with the omw-data release was offer both the individual wordnets and the full set as separate downloads.
It helps a bit (but is not necessary) to publish an index.toml file with your release (see omw-data's release, again), especially if you have multiple lexicons. If you package OWN-PT, OWN-EN, and ... OWN-ALL (?), they each need an entry with a unique project name.
Why are you packaging a version of the PWN at all? Is it just to include morphosemantic links? If so, maybe we can just ~~rebuild the version packaged with OMW~~ add them to the Open English Wordnet, ~~but we should ask Christiane if we want to call it the Princeton WordNet~~ (cannot).

The OWN-EN has gained and lost some words compared to ~~PWN 3.0~~ OMW's English wordnet derived from the PWN:

>>> wn.lexicons(lang='en')
(<Lexicon ewn:2020 [en]>, <Lexicon pwn:3.0 [en]>, <Lexicon own-en:1.0.0 [en]>)
>>> pwn30 = wn.Wordnet('pwn:3.0')
>>> own_en = wn.Wordnet('own-en')
>>> missing = {w.lemma() for w in pwn30.words()} - {w.lemma() for w in own_en.words()}
>>> gained = {w.lemma() for w in own_en.words()} - {w.lemma() for w in pwn30.words()}
>>> len(missing)
916
>>> len(gained)
417

Most or all of these seem to be bugs in how the ~~OWN's PWN~~ OMW's English lexicon was created, as it includes the adjposition in adjective's lemmas (see bond-lab/omw-data#8):

>>> list(missing)[:20]
['guardant(ip)', 'healing(p)', 'lone(a)', 'live(a)', 'astir(p)', 'down(a)', 'prior(a)', 'motivative(a)', 'smelling(p)', 'maternal(p)', 'aghast(p)', 'anaesthetic(a)', 'on the offensive(p)', 'au naturel(p)', 'indisposed(p)', 'privy(p)', 'right-hand(a)', 'trespassing(a)', 'great(p)', 'unbeholden(p)']
>>> [w.lemma() for w in own_en.words() if w.lemma().endswith(')')]  # None in OWN-EN with (p), (a), (ip) in lemma
[]

But that's not the whole story. This will require some more digging. Notice to @fcbond: OMW needs to rebuild its PWN lexicons.

goodmami commented 2 years ago

One more thing:

It's ok that the version is 1.0.0 (although isn't 1.0 enough? Do you plan to release frequently enough that a third-level version specifier is necessary?), because the ID is changing. Before it was released with OMW with the id as porwn and the version 1.3+omw, so this is effectively a new resource. You might just explain the history somewhere.

arademaker commented 2 years ago

Hi @goodmami, thank you for the feedback, I am addressing all of them in the 1.0.0 release. The only remain relevant topic is:

Why are you packaging a version of the PWN at all? Is it just to include morphosemantic links? If so, maybe we can just ~rebuild the version packaged with OMW~ add them to the Open English Wordnet, ~but we should ask Christiane if we want to call it the Princeton WordNet~ (cannot).

Indeed, our OWN distribution of English and Portuguese intersects with @fcbond's OMW distribution of wordnets. At some point, I would like to discuss with @fcbond a closer collaboration. My primary motivation is to have English data as support for the Portuguese data, but enhancing EN data independently from the @jmccrae's https://github.com/globalwordnet/english-wordnet or any other fork of Princeton data. On the other hand, we could make our Portuguese data always part of OMW, but we would miss the ability to control releases independently. This discussion is also related to the ILI initiative and my still not clear understanding of its evolution.

goodmami commented 2 years ago

My primary motivation is to have English data as support for the Portuguese data, but enhancing EN data independently from the @jmccrae's https://github.com/globalwordnet/english-wordnet or any other fork of Princeton data.

I think I can speak for Francis as well as myself in saying that we have zero interest in maintaining or developing another fork of the Princeton WordNet. Our work in updating the OMW English Wordnet is only for compatibility with the old versions of the PWN, and we won't be fixing or augmenting them further than that. The only maintenance we plan to do is fixing remaining bugs in the format conversion. If you need to develop another fork for your work, that's fine, and we also really appreciate the issues your group has raised.

This discussion is also related to the ILI initiative and my still not clear understanding of its evolution.

With the advent of the ILI, there is no longer any need to keep using the old English wordnets as the ILI IDs are stable across versions, unlike the WNDB offsets. OMW's non-English wordnets can now link with the Open English Wordnet, or even with each other without any English wordnet (although they are still mostly devoid of their own synset relations, so they aren't terribly useful without some English wordnet).

Finally, as this pertains to Wn, the OWN lexicons do not need to be in OMW to be included in Wn's index (see OdeNet and OEWN, for example). I'm happy to include them in the next release, although I'd hope to be able to install the OWN-PT and OWN-EN lexicons separately.

fcbond commented 2 years ago

I also think that it would be best to have these as two separate packages, so people can download just the wordnet that they want, ...

I think it is useful to have the English lexicon with your fixes available, but most of the time I expect people will just want the Open English Wordnet.

On Wed, Sep 22, 2021 at 2:04 AM Michael Wayne Goodman < @.***> wrote:

Thanks, now I see it. It downloads and imports without issue:

import wn

wn.download('https://github.com/own-pt/openWordnet-PT/releases/download/v1.0.0-alpha/own.tar.gz')

Download [##############################] (14924942/14924942 bytes) Complete

Added own-pt:1.0.0 (OpenWordnet-PT)

Added own-en:1.0.0 (OpenWordnet-EN)

PosixPath('/home/goodmami/.wn_data/downloads/65646cb71b01af18d11e33eb1c5ef6a47d8c6b54')

wn.synsets('São Paulo', lang='pt') # original lemma

[Synset('own-pt-synset-08857529-n'), Synset('own-pt-synset-11225661-n')]

wn.synsets('Sao Paulo', lang='pt') # normalized diacritics

[Synset('own-pt-synset-08857529-n'), Synset('own-pt-synset-11225661-n')]

wn.synsets('São Paulo', lang='pt')[0].translate(lang='en') # test ILI translation to English

[Synset('ewn-08876521-n'), Synset('pwn-08857529-n'), Synset('own-en-synset-08857529-n')]

wn.synsets('São Paulo', lang='pt')[0].translate(lang='en')[0].lemmas()

['Sao Paulo']

Some comments:

-

Only packaging your OWN-EN and OWN-PT lexicons together gives users no choice in which ones to download. If they just want the OWN-PT lexicon without the other one, they should be able to do so. What we did with the omw-data release https://github.com/bond-lab/omw-data/releases was offer both the individual wordnets and the full set as separate downloads.

It helps a bit (but is not necessary) to publish an index.toml file with your release (see omw-data's release, again), especially if you have multiple lexicons. If you package OWN-PT, OWN-EN, and ... OWN-ALL (?), they each need an entry with a unique project name.

Why are you packaging a version of the PWN at all? Is it just to include morphosemantic links? If so, maybe we can just rebuild the version packaged with OMW, but we should ask Christiane if we want to call it the Princeton WordNet.

The OWN-EN has gained and lost some words compared to PWN 3.0:

wn.lexicons(lang='en')

(<Lexicon ewn:2020 [en]>, <Lexicon pwn:3.0 [en]>, <Lexicon own-en:1.0.0 [en]>)

pwn30 = wn.Wordnet('pwn:3.0')

own_en = wn.Wordnet('own-en')

missing = {w.lemma() for w in pwn30.words()} - {w.lemma() for w in own_en.words()}

gained = {w.lemma() for w in own_en.words()} - {w.lemma() for w in pwn30.words()}

len(missing)

916

len(gained)

417

Most or all of these seem to be bugs in how the OWN's PWN lexicon was created, as it includes the adjposition in adjective's lemmas:

list(missing)[:20]

['guardant(ip)', 'healing(p)', 'lone(a)', 'live(a)', 'astir(p)', 'down(a)', 'prior(a)', 'motivative(a)', 'smelling(p)', 'maternal(p)', 'aghast(p)', 'anaesthetic(a)', 'on the offensive(p)', 'au naturel(p)', 'indisposed(p)', 'privy(p)', 'right-hand(a)', 'trespassing(a)', 'great(p)', 'unbeholden(p)']

[w.lemma() for w in own_en.words() if w.lemma().endswith(')')] # None in OWN-EN with (p), (a), (ip) in lemma

[]

But that's not the whole story. This will require some more digging. Notice to @fcbond https://github.com/fcbond: OMW needs to rebuild its PWN lexicons.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-924230450, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRUERILXQVZ7IX53GS3UDDCJXANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

fcbond commented 2 years ago

G'day,

On Sat, Sep 25, 2021 at 2:15 PM Michael Wayne Goodman < @.***> wrote:

My primary motivation is to have English data as support for the Portuguese data, but enhancing EN data independently from the @jmccrae https://github.com/jmccrae's https://github.com/globalwordnet/english-wordnet or any other fork of Princeton data.

I can see the value of this internally, but at least for me, I am trying to move all the changes we have made in the internal wordnet at NTU into the Open English Wordnet, I would prefer to have one good wordnet we can all use!

I think I can speak for Francis as well as myself in saying that we have zero interest in maintaining or developing another fork of the Princeton WordNet. Our work in updating the OMW English Wordnet is only for compatibility with the old versions of the PWN, and we won't be fixing or augmenting them further than that. The only maintenance we plan to do is fixing remaining bugs in the format conversion. If you need to develop another fork for your work, that's fine, and we also really appreciate the issues your group has raised.

This discussion is also related to the ILI initiative and my still not clear understanding of its evolution.

With the advent of the ILI, there is no longer any need to keep using the old English wordnets as the ILI IDs are stable across versions, unlike the WNDB offsets. OMW's non-English wordnets can now link with the Open English Wordnet, or even with each other without any English wordnet (although they are still mostly devoid of their own synset relations, so they aren't terribly useful without some English wordnet).

I am still working on properly realizing the dream of adding new entries to ILI, I hope to make some real progress early next year. I would be happy to have a chat with you (and Michael if he is free) about this online.

Finally, as this pertains to Wn, the OWN lexicons do not need to be in OMW to be included in Wn's index (see OdeNet and OEWN, for example). I'm happy to include them in the next release, although I'd hope to be able to install the OWN-PT and OWN-EN lexicons separately.

I agree that adding them (separately) there is the best option.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

vcvpaiva commented 2 years ago

G'day Francis and Michael,

Thanks for copying me in this email.

I agree with Alexandre that I also would prefer not to be tied-up to the Open English Wordnet, if possible. (Version zero of the English WordNet--only removing typos and clear mistakes from PWN seemed easy enough to achieve, but more than that seems complicated.)

I am trying to move all the changes we have made in the internal wordnet at NTU into the Open English Wordnet, I would prefer to have one good wordnet we can all use! This would be ideal, but the use cases are moving in very different directions, it seems to me.

But it also seems to me more sensible (as you both seem to be saying) to have separate downloads of OWN-PT and OWN-EN.

So if you do have a conversation about it

I would be happy to have a chat with you (and Michael if he is free) about this online. I'd try to participate, if you will have me.

best, Valeria

On Sat, Sep 25, 2021 at 2:04 AM Francis Bond @.***> wrote:

G'day,

On Sat, Sep 25, 2021 at 2:15 PM Michael Wayne Goodman < @.***> wrote:

My primary motivation is to have English data as support for the Portuguese data, but enhancing EN data independently from the @jmccrae https://github.com/jmccrae's https://github.com/globalwordnet/english-wordnet or any other fork of Princeton data.

I can see the value of this internally, but at least for me, I am trying to move all the changes we have made in the internal wordnet at NTU into the Open English Wordnet, I would prefer to have one good wordnet we can all use!

I think I can speak for Francis as well as myself in saying that we have zero interest in maintaining or developing another fork of the Princeton WordNet. Our work in updating the OMW English Wordnet is only for compatibility with the old versions of the PWN, and we won't be fixing or augmenting them further than that. The only maintenance we plan to do is fixing remaining bugs in the format conversion. If you need to develop another fork for your work, that's fine, and we also really appreciate the issues your group has raised.

This discussion is also related to the ILI initiative and my still not clear understanding of its evolution.

With the advent of the ILI, there is no longer any need to keep using the old English wordnets as the ILI IDs are stable across versions, unlike the WNDB offsets. OMW's non-English wordnets can now link with the Open English Wordnet, or even with each other without any English wordnet (although they are still mostly devoid of their own synset relations, so they aren't terribly useful without some English wordnet).

I am still working on properly realizing the dream of adding new entries to ILI, I hope to make some real progress early next year. I would be happy to have a chat with you (and Michael if he is free) about this online.

Finally, as this pertains to Wn, the OWN lexicons do not need to be in OMW to be included in Wn's index (see OdeNet and OEWN, for example). I'm happy to include them in the next release, although I'd hope to be able to install the OWN-PT and OWN-EN lexicons separately.

I agree that adding them (separately) there is the best option.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-927091805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3HZ6GUU25RBIKPXJFFLUDWGATANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

arademaker commented 2 years ago

Hi @goodmami, I have just published the v1.0.0 release. Regarding your comments above:

I kept the last digit, for now, v1.0.0 and not v1.0. You are probably right regarding the use of the last digit, but I feel like if we need to remove the last digit later it will be easier than adding it. Anyway, using the full semantic version schema makes it clear for people that know about semantic versioning and we don't need to add further explanation. Additionally, having an extra zero in the end of the releases, if we end up not creating patches, will not hurt right?
I have now three tar.gz files: the collection own-XX (where XX is PT and EN), and the wordnets themselves: own-en and own-pt. I also don't anticipate much interest in the own-en itself, since the https://en-word.net seems to be actively maintained, but again, it doesn't hurt, we just need to be clear about the origin of each data.
I added the index.toml file

Next, I would be surely interested in a conversation with you and @fcbond when possible. We also need to clean up our workflow and automate the release script. If possible, once you test the release, I suspect you can close this issue.

fcbond commented 2 years ago

Hi,

thanks for splitting them,

I would be happy to have a conversationt with you and Michael and Valeria, I am generally OK on Mon, Wed, or Fri mornings (your Sun, Tue or Thur afternoon/evening, I think).

On Tue, Oct 5, 2021 at 10:18 PM Alexandre Rademaker < @.***> wrote:

Hi @goodmami https://github.com/goodmami, I have just published the v1.0.0 release. Regarding your comments above:

I kept the last digit, for now, v1.0.0 and not v1.0. You are probably right regarding the use of the last digit, but I feel like if we need to remove the last digit later it will be easier than adding it. Anyway, using the full semantic version schema makes it clear for people that know about semantic versioning and we don't need to add further explanation. Additionally, having an extra zero in the end of the releases, if we end up not creating patches, will not hurt right?

I have now three tar.gz files: the collection own-XX (where XX is PT and EN), and the wordnets themselves: own-en and own-pt. I also don't anticipate much interest in the own-en itself, since the https://en-word.net seems to be actively maintained, but again, it doesn't hurt, we just need to be clear about the origin of each data.

I added the index.toml file

Next, I would be surely interested in a conversation with you and @fcbond https://github.com/fcbond when possible. We also need to clean up our workflow and automate the release script. If possible, once you test the release, I suspect you can close this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-934448015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRRRWFP7YZGSBAJYKOLUFMBENANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

vcvpaiva commented 2 years ago

Thanks for the message, Francis! I hope you don't manage to organize it for this week, as tomorrow I have major dental work scheduled. oh well! best Valeria

On Tue, Oct 5, 2021 at 7:50 PM Francis Bond @.***> wrote:

Hi,

thanks for splitting them,

I would be happy to have a conversationt with you and Michael and Valeria, I am generally OK on Mon, Wed, or Fri mornings (your Sun, Tue or Thur afternoon/evening, I think).

On Tue, Oct 5, 2021 at 10:18 PM Alexandre Rademaker < @.***> wrote:

Hi @goodmami https://github.com/goodmami, I have just published the v1.0.0 release. Regarding your comments above:

I kept the last digit, for now, v1.0.0 and not v1.0. You are probably right regarding the use of the last digit, but I feel like if we need to remove the last digit later it will be easier than adding it. Anyway, using the full semantic version schema makes it clear for people that know about semantic versioning and we don't need to add further explanation. Additionally, having an extra zero in the end of the releases, if we end up not creating patches, will not hurt right?

I have now three tar.gz files: the collection own-XX (where XX is PT and EN), and the wordnets themselves: own-en and own-pt. I also don't anticipate much interest in the own-en itself, since the https://en-word.net seems to be actively maintained, but again, it doesn't hurt, we just need to be clear about the origin of each data.

I added the index.toml file

Next, I would be surely interested in a conversation with you and @fcbond https://github.com/fcbond when possible. We also need to clean up our workflow and automate the release script. If possible, once you test the release, I suspect you can close this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-934448015, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAIPZRRRWFP7YZGSBAJYKOLUFMBENANCNFSM4XC6SJHA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-935309457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3HYM76LBT4Z6TM2KWPTUFO2ODANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

fcbond commented 2 years ago

On Wed, Oct 6, 2021 at 11:22 AM Valeria de Paiva @.***> wrote:

Thanks for the message, Francis! I hope you don't manage to organize it for this week, as tomorrow I have major dental work scheduled. oh well!

I don't think there is any rush, so next week would be fine. I hope the surgery goes well.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

arademaker commented 2 years ago

I believe we can close that @goodmami , don't you think?

goodmami commented 2 years ago

@arademaker let's leave it open until OWN-PT is actually added to Wn's index.

Also:

If the above-mentioned conversation happened, I wasn't involved, so I'm not sure if anything was brought up or resolved.
In OWN-PT's 1.0.0 release, I noticed some issues:
- The LICENSE for all 3 (OWN collection, OWN-PT package, and OWN-EN package) all had the same text, meaning they all were for "OWN-PT" (and not for "OWN-EN", etc., but see (3) below)
- The index.toml entries all said "Please consult the LICENSE files.", which is less useful than just saying "http://creativecommons.org/licenses/by/4.0/" or "CC-BY 4.0". I can fix this on my end.
- There were some macOS temporary files in the release (._LICENSE); this is not a blocking issue as Wn just ignores them, but it could be improved.
Since OWN-PT is the main project and OWN-EN is something of a sub-project of it, it seems like too big a claim to use just "OWN" as the id for the collection (maybe we'll see an Ossetian, Odia, or Oromo Wordnet someday?), or do you have plans for more wordnets under the OWN umbrella? Fortunately, if we decide to use a different identifier for the collection (e.g., OWN-ALL as suggested above), it just requires a change in the index and not to the released files.

arademaker commented 2 years ago

Hi @goodmami, thank you for always being so vigilant!

it didn't happen, but I am afraid only later this year (end of the semester) will I be able to discuss plans. Too many projects and courses are going on.
In OWN-EN, I was pointing to PWN License before, but I am now taking OWN-EN as a branch of PWN with its license (it is acceptable to PWN license anyway):
- I have forgotten to change the name "OWN-PT", I will fix it
- in the index.toml, you are right, it can simply link to the creative commons website, but reference the LICENSE file don't duplicate information, if the license change, I need only to change in one place.
- OK, I will try to avoid the inclusion of the MacOS files.
No plains for other wordnets so far (but who knows?!). Not sure if I take OWN-EN as a subproject of OWN-PT, although it was started with that goal. We can think about better names, OWN-ALL looks strange for me, but works. Maybe a longer name like OpenWordnets, OpenWordnet-PT, or OpenWordnet-EN?

goodmami commented 2 years ago

Thanks for the responses.

but reference the LICENSE file don't duplicate information, if the license change, I need only to change in one place

Wn's documentation doesn't clearly mention this, but the license field may be versioned as well. That is, you can put the license key in the index.toml file under the project, which serves as a default, or under a specific version. Here is how it is retrieved, where the project-level version is used only if the version-level one isn't specified:

https://github.com/goodmami/wn/blob/3411b035ca4be72a6a86629fbc196bf08b5a24d6/wn/_config.py#L146

Not sure if I take OWN-EN as a subproject of OWN-PT, although it was started with that goal.

Sorry, what I meant is that OWN-EN was created to support OWN-PT and not to be used as a standalone alternative English wordnet. At least, that's how I understand it. Thus, OWN-PT is the primary product of the OWN project and, in this case, claiming the more general own identifier mainly for the own-pt lexicon seemed like too big a grab. However, if Wn gets support for redirects in the index (#142), then it might not be a big deal, assuming no other existing wordnet has a claim to the own identifier.

Maybe a longer name like OpenWordnets, OpenWordnet-PT, or OpenWordnet-EN?

Those seem better for the label field. The id should be short.

goodmami commented 2 years ago

@arademaker, earlier you said, in reference to OWN-EN, the following:

My primary motivation is to have English data as support for the Portuguese data

Can you be more specific about what support it provides? It looks like OWN-PT has ILIs for all its synsets, it has its own synset relations, and there's no <Requires> element on the Lexicon, so it seems like OWN-EN is not actually necessary for using OWN-PT?

vcvpaiva commented 2 years ago

hi Michael,

My primary motivation is to have English data as support for the Portuguese data Can you be more specific about what support it provides?

I think Alexandre's motivation (or at least mine) is that given that PWN is more complete than OWN, having the English version side-by-side with the Portuguese one shows human users what we're talking about. So OWN-PT works as a bilingual dictionary/thesaurus for humans and PWN is a kind of warranty that it's working as much as possible at the moment.

It also helps developers keep track of what we haven't done yet, of course.

Does this make sense to you?

Best, Valeria

On Thu, Oct 21, 2021 at 11:34 AM Michael Wayne Goodman < @.***> wrote:

@arademaker https://github.com/arademaker, earlier you said, in reference to OWN-EN, the following:

My primary motivation is to have English data as support for the Portuguese data

Can you be more specific about what support it provides? It looks like OWN-PT has ILIs for all its synsets, it has its own synset relations, and there's no element on the Lexicon, so it seems like it's not actually necessary?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-948896607, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3HZQ4TQHIQADFCJWH33UIBMLNANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

goodmami commented 2 years ago

@vcvpaiva Thanks for explaining. That's great if OWN-EN assists the developers of OWN-PT with their work, but it's not clear what need it serves for other users of Wn. For instance, the bilingual dictionary functionality works just as well with the other English wordnets:

>>> pt = wn.Wordnet('own-pt')
>>> bola = pt.synsets('bola')[0]
>>> bola.definition()
'objeto redondo que é atirado ou jogado ou chutado em jogos'
>>> bola.translate(lexicon='own-en')[0].lemmas()     # OWN-EN
['ball']
>>> bola.translate(lexicon='pwn:3.0')[0].lemmas()    # OMW English Wordnet from PWN 3.0
['ball']
>>> bola.translate(lexicon='oewn:2021')[0].lemmas()  # Open English Wordnet 2021
['ball']

So I'm hesitant at putting another English wordnet in Wn's index because I wish to avoid overwhelming users with choices. I'm not entirely opposed to OWN-EN, however, as the other English wordnets do not provide morphosemantic links (although ultimately I'd prefer that those links are just added to the Open English Wordnet).

But, regardless of whether it is added to the index, if OWN-EN is not required for the general use of OWN-PT, I don't think there is any need for the collection entry own in the index. Basically, this means that users who want to install both would do:

>>> wn.download('own-pt')
>>> wn.download('own-en')

instead of

>>> wn.download('own')

And there's always the option of downloading directly from a URL:

>>> wn.download('https://github.com/own-pt/openWordnet-PT/releases/download/v1.0.0/own.tar.gz')

arademaker commented 2 years ago

Can you be more specific about what support it provides?

@goodmami I prefer to speak for myself! ;-) Yes, @vcvpaiva has good points but I want to report my own position here.

I am completing, among other projects, the annotation of the glosses (https://github.com/own-pt/glosstag). During that work, I have been collecting many potential changes to be done in PWN. If I chance PWN, it is not PWN anymore... it is my own fork of PWN, called OWN-EN. For many applications I work with, the OWN-EN itself is the goal, it is not only a support for OWN-PT construction.

On the other hand, we don't have yet a clear way to contribute to the ILI (actually as I said above, the ILI workflow is still not very clear for me). Finally, I don't agree with the current crowdsourcing procedure in EWN... Please don't take me wrong, I am still willing to contribute, but like you, I am quite vigilant and resistant when I don't believe in something.

Given all that, the current way to keep the OWN-PT and OWN-EN in sync and be able to map to all other wordnets is by forking the almost universally accepted wordnet and applying conservative and meticulous changes on OWN-EN.

Besides all of the above, I never considered OWN-PT as a simple translation from PWN, I believe we do have applications that depend on concepts not covered by PWN. I do believe that some parts of PWN can be improved. But if I add a new concept to OWN-PT, I would like to have it replicated in the OWN-EN, because having a multilingual approach many times helps in the conceptualization of things, as @vcvpaiva said.

So it is complicated, it is not only to support OWN-PT that we forked PWN in OWN-EN. Sorry, I should have been more careful with my words before. I believe we need and can expose to others a conservative extension/adaptation of PWN.

it has its own synset relations

Until recently, in the RDF, we don't have the relations in the Portuguese part, only owl:sameAs mapping the Portuguese synsets to EN synsets. We choose to replicate the EN relations into PT to 1) turn queries easier; 2) allow independent changes in the relations. We are still releasing the RDF, so OWN-PT and OWN-EN may find other uses independent of wn library.

there's no element on the Lexicon

I am reading now the https://globalwordnet.github.io/schemas/ and trying to understand the semantics of the requires tag. It may be the case that we should rethink the projection of the relations from EN to PT and, considering that we want to keep the bilingual mapping, we should always consider the PT part a projection of PT words in the structure of the EN part. On the other hand, we may have situations where a relationship between two concepts in EN may not be obvious in PT, especially when one of these concepts is not lexicalized in PT.

Not sure, all of that requires more thought. Wordnets are not dictionaries. Mapping concepts is different from providing translations of words. See http://wn.mybluemix.net/synset?id=01076514-v, in Portuguese, we don't have translations for many English verbs, we tend to use adverbs or complements to specify how/with what the action was taken. We have also long discussions like https://github.com/own-pt/openWordnet-PT/issues/182#issuecomment-924176422 about English adjectives that we don't have translations, only as phrases but they seem to be outside the scope of a lexical resource (see http://wn.mybluemix.net/synset?id=02576489-a).

But I do understand your perspective in the wn library... it seems fine to me if you don't want to have the own as a whole and even if you don't want to have the own-en indexed. As you said, users can always use the complete URL to download OWN-EN if it ends up relevant to them. After all, you are the wn owner and can ultimately decide what to maintain in the index of your library.

I think that the index you have in wn is almost a shortcut for the users. In the end, I believe it is better to make the library independent of the discussion about what are the relevant wordnets out there, what wordnets are worth being exposed to users or not, etc.

(BTW, the situation is similar to the maintainers of package managers for OS or programming languages, right? HomeBrew? QuickLisp, stack, pip, etc)

goodmami commented 2 years ago

Thanks for the additional context. So OWN-EN is a project in its own right, with its own distinguishing development methodology and features, and it is being used in some applications outside the development of OWN-PT. Then it seems fair to include it in the index. For the new user who wants an English wordnet but is unfamiliar with the various options, it might help if we provide a brief description of what sets it apart from the others. This could be in the documentation, the OWN-EN project page, or even the label attribute of the WN-LMF <Lexicon> element.

we don't have yet a clear way to contribute to the ILI

I expect @fcbond will get things running again in the next few months. Also see globalwordnet/cili#9.

I am reading now the https://globalwordnet.github.io/schemas/ and trying to understand the semantics of the requires tag.

There's also some description in McCrae et al. 2021 (which we are both coauthors of):

The purpose is to declare what, exactly, is required so that an application that hosts the wordnets can signal to the user if dependencies are unmet, or to limit the wordnets that may be used when traversing external synset relations. It is left implicit which elements or kinds of elements from the external wordnet become available to the dependent wordnet but, following the OMW’s behaviour, an application may choose to only allow synset relations and not, say, synsets or lexical entries.

That is, the <Requires> element is just a descriptive specification of the dependency and it is up to the application to interpret what that means. Wn uses it pretty much exactly as described above: warning of unmet dependencies and specifying which lexicons may be sources of borrowed synset relations.

You could use this mechanism for OWN-PT to borrow synset relations from OWN-EN, but if you've already ported them over then it's no longer a dependency and there'd be no benefit. However if you're adding more synsets with ILI correspondences in OWN-EN and aren't porting over the relevant synset relations, then it might make sense.

After all, you are the wn owner and can ultimately decide what to maintain in the index of your library.

True, but I don't mean to be a fickle gatekeeper. The Wn index and database are setup such that it's not possible to have two distinct lexicons with the same id and version. Collections also use up an ID in this namespace even though there is no lexicon getting added to the database with that ID (just the lexicons that are part of the collection). I therefore find myself a bit guarded about adding new collections.

I think that the index you have in wn is almost a shortcut for the users. In the end, I believe it is better to make the library independent of the discussion about what are the relevant wordnets out there, what wordnets are worth being exposed to users or not, etc.

Yes, good points. Users who come to Wn may not look beyond the list that is provided for them, so I don't want to be unnecessarily exclusive. And like I said above, I'm not opposed to adding OWN-EN if it's meant for use beyond the development of OWN-PT and if it's different from what's already there. I'm having a harder time finding a use case for the own collection entry as it only has two lexicons. Beyond being able to download both at once, the collection would allow you to have a label, language, and license statement for OWN as a whole. But that's about it for now.

In any case, I'm prepared to add the own collection to the index if you want to push for it. If in the future we get some other "OWN" project that wants the ID, we can decide what to do then.

goodmami commented 2 years ago

@arademaker I don't want to be unnecessarily obstructive so I've added the own collection to the index in the latest commit, as well as the own-en and own-pt packages. It it becomes a problem we can remove it in a future version.