Closed arademaker closed 2 years ago
That would be great. Here's the steps:
id
and version
attributes on <Lexicon>
are appropriate for the releaseOnce that's done I can add an index entry to Wn so it can be downloaded more easily. From the OMW, the project and lexicon identifier is porwn
(Francis takes the ISO639-3 code and appends wn
). If you want to use something else, like own-pt
, then maybe we can change it for the OMW, too? What do you think about this, @fcbond?
Also, it's not required, but the more things that are linked (accurately) to ILI the better. This is especially true if you rely on, e.g., PWN for synset relations.
Also if you want to include files like your project README, LICENSE, or citation.bib
, you can distribute it as a package directory (we also do this with OdeNet). There's some documentation here, and I'm happy to answer questions.
I would encourage you to go the package directory route, as I think it is good to include the README.
I guess I can change the name in the OMW release. This makes it clear that the new lexicon is an upgraded version.
On Fri, Feb 5, 2021 at 1:51 PM Michael Wayne Goodman < notifications@github.com> wrote:
Also if you want to include files like your project README, LICENSE, or citation.bib, you can distribute it as a package directory (we also do this with OdeNet). There's some documentation here https://wn.readthedocs.io/en/latest/guides/lexicons.html, and I'm happy to answer questions.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-773807731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRTMRUN5H7H7JJBILU3S5OBMZANCNFSM4XC6SJHA .
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
I guess I can change the name in the OMW release. This makes it clear that the new lexicon is an upgraded version.
Yes, it is the choice of the wordnet maintainer what id
and version
they choose when self-publishing, but I guess the question is if the OMW will stick with it's current ID/version or adopt the upstream one. The trouble with having different IDs is that they might look like different resources rather than just different versions. The trouble with harmonizing the IDs is what to do with the old one in Wn's index. I could "deprecate" the old one in some way with a warning, then proceed as if the user had used the new ID. But currently there is no functionality like that.
I'd suggest really renaming as "porwn" will give all the wrong vibes. Thanks!
@vcvpaiva agreed! Unfortunately this isn't really up to me, but to the OWN-PT or OMW maintainers (who are both subscribed to this issue). That said, I do have write-access to the omw-data repository whence Wn gets the OMW lexicons, so if the maintainers agree I could effect that change.
anyway, we are preparing the LMF-XML of OWN-PT for closing this issue...
Yes, I prefer own-pt
as identifier.
@arademaker Great, then just ensure your <Lexicon>
element has id="own-pt"
as an attribute. If you're planning on releasing and distributing it yourself (e.g., as is currently done with EWN and OdeNet) I can simply link to the published URL with the new identifier. Or if you're planning on packaging the new version with the OMW, I'll see about making those changes.
I will make a release in our repository. I will let you know when ready or you can also watch https://github.com/own-pt/openWordnet-PT or the issue linked above. Thank you @goodmami
Hi @goodmami can you check the draft of my 1.0.0 release? The wn package was included as an extra artifact of the release. See https://github.com/own-pt/openWordnet-PT/releases. I am including two XML. One based on the PWN and the other the OWN-PT. I am calling them OWN-EN and OWN-PT.
I just learned that you can't see drafts so I published it as an pre-release https://github.com/own-pt/openWordnet-PT/releases/tag/v1.0.0-alpha
Alexandre, two suggestions:
remove 'the" as this is our interpretation of the morphosemantic links.
Best, Valeria
On Tue, Sep 21, 2021 at 9:20 AM Alexandre Rademaker < @.***> wrote:
I just learned that you can't see drafts so I published it as an pre-release https://github.com/own-pt/openWordnet-PT/releases/tag/v1.0.0-alpha
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-924145757, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3H34GJMCNLH5FTKNV3TUDCWGNANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Thanks, now I see it. It downloads and imports without issue:
>>> import wn
>>> wn.download('https://github.com/own-pt/openWordnet-PT/releases/download/v1.0.0-alpha/own.tar.gz')
Download [##############################] (14924942/14924942 bytes) Complete
Added own-pt:1.0.0 (OpenWordnet-PT)
Added own-en:1.0.0 (OpenWordnet-EN)
PosixPath('/home/goodmami/.wn_data/downloads/65646cb71b01af18d11e33eb1c5ef6a47d8c6b54')
>>> wn.synsets('São Paulo', lang='pt') # original lemma
[Synset('own-pt-synset-08857529-n'), Synset('own-pt-synset-11225661-n')]
>>> wn.synsets('Sao Paulo', lang='pt') # normalized diacritics
[Synset('own-pt-synset-08857529-n'), Synset('own-pt-synset-11225661-n')]
>>> wn.synsets('São Paulo', lang='pt')[0].translate(lang='en') # test ILI translation to English
[Synset('ewn-08876521-n'), Synset('pwn-08857529-n'), Synset('own-en-synset-08857529-n')]
>>> wn.synsets('São Paulo', lang='pt')[0].translate(lang='en')[0].lemmas()
['Sao Paulo']
Some comments:
index.toml
file with your release (see omw-data's release, again), especially if you have multiple lexicons. If you package OWN-PT, OWN-EN, and ... OWN-ALL (?), they each need an entry with a unique project name.The OWN-EN has gained and lost some words compared to PWN 3.0 OMW's English wordnet derived from the PWN:
>>> wn.lexicons(lang='en')
(<Lexicon ewn:2020 [en]>, <Lexicon pwn:3.0 [en]>, <Lexicon own-en:1.0.0 [en]>)
>>> pwn30 = wn.Wordnet('pwn:3.0')
>>> own_en = wn.Wordnet('own-en')
>>> missing = {w.lemma() for w in pwn30.words()} - {w.lemma() for w in own_en.words()}
>>> gained = {w.lemma() for w in own_en.words()} - {w.lemma() for w in pwn30.words()}
>>> len(missing)
916
>>> len(gained)
417
Most or all of these seem to be bugs in how the OWN's PWN OMW's English lexicon was created, as it includes the adjposition in adjective's lemmas (see bond-lab/omw-data#8):
>>> list(missing)[:20]
['guardant(ip)', 'healing(p)', 'lone(a)', 'live(a)', 'astir(p)', 'down(a)', 'prior(a)', 'motivative(a)', 'smelling(p)', 'maternal(p)', 'aghast(p)', 'anaesthetic(a)', 'on the offensive(p)', 'au naturel(p)', 'indisposed(p)', 'privy(p)', 'right-hand(a)', 'trespassing(a)', 'great(p)', 'unbeholden(p)']
>>> [w.lemma() for w in own_en.words() if w.lemma().endswith(')')] # None in OWN-EN with (p), (a), (ip) in lemma
[]
But that's not the whole story. This will require some more digging. Notice to @fcbond: OMW needs to rebuild its PWN lexicons.
One more thing:
1.0.0
(although isn't 1.0
enough? Do you plan to release frequently enough that a third-level version specifier is necessary?), because the ID is changing. Before it was released with OMW with the id as porwn
and the version 1.3+omw
, so this is effectively a new resource. You might just explain the history somewhere.Hi @goodmami, thank you for the feedback, I am addressing all of them in the 1.0.0 release. The only remain relevant topic is:
- Why are you packaging a version of the PWN at all? Is it just to include morphosemantic links? If so, maybe we can just ~rebuild the version packaged with OMW~ add them to the Open English Wordnet, ~but we should ask Christiane if we want to call it the Princeton WordNet~ (cannot).
Indeed, our OWN distribution of English and Portuguese intersects with @fcbond's OMW distribution of wordnets. At some point, I would like to discuss with @fcbond a closer collaboration. My primary motivation is to have English data as support for the Portuguese data, but enhancing EN data independently from the @jmccrae's https://github.com/globalwordnet/english-wordnet or any other fork of Princeton data. On the other hand, we could make our Portuguese data always part of OMW, but we would miss the ability to control releases independently. This discussion is also related to the ILI initiative and my still not clear understanding of its evolution.
My primary motivation is to have English data as support for the Portuguese data, but enhancing EN data independently from the @jmccrae's https://github.com/globalwordnet/english-wordnet or any other fork of Princeton data.
I think I can speak for Francis as well as myself in saying that we have zero interest in maintaining or developing another fork of the Princeton WordNet. Our work in updating the OMW English Wordnet is only for compatibility with the old versions of the PWN, and we won't be fixing or augmenting them further than that. The only maintenance we plan to do is fixing remaining bugs in the format conversion. If you need to develop another fork for your work, that's fine, and we also really appreciate the issues your group has raised.
This discussion is also related to the ILI initiative and my still not clear understanding of its evolution.
With the advent of the ILI, there is no longer any need to keep using the old English wordnets as the ILI IDs are stable across versions, unlike the WNDB offsets. OMW's non-English wordnets can now link with the Open English Wordnet, or even with each other without any English wordnet (although they are still mostly devoid of their own synset relations, so they aren't terribly useful without some English wordnet).
Finally, as this pertains to Wn, the OWN lexicons do not need to be in OMW to be included in Wn's index (see OdeNet and OEWN, for example). I'm happy to include them in the next release, although I'd hope to be able to install the OWN-PT and OWN-EN lexicons separately.
I also think that it would be best to have these as two separate packages, so people can download just the wordnet that they want, ...
I think it is useful to have the English lexicon with your fixes available, but most of the time I expect people will just want the Open English Wordnet.
On Wed, Sep 22, 2021 at 2:04 AM Michael Wayne Goodman < @.***> wrote:
Thanks, now I see it. It downloads and imports without issue:
import wn
wn.download('https://github.com/own-pt/openWordnet-PT/releases/download/v1.0.0-alpha/own.tar.gz')
Download [##############################] (14924942/14924942 bytes) Complete
Added own-pt:1.0.0 (OpenWordnet-PT)
Added own-en:1.0.0 (OpenWordnet-EN)
PosixPath('/home/goodmami/.wn_data/downloads/65646cb71b01af18d11e33eb1c5ef6a47d8c6b54')
wn.synsets('São Paulo', lang='pt') # original lemma
[Synset('own-pt-synset-08857529-n'), Synset('own-pt-synset-11225661-n')]
wn.synsets('Sao Paulo', lang='pt') # normalized diacritics
[Synset('own-pt-synset-08857529-n'), Synset('own-pt-synset-11225661-n')]
wn.synsets('São Paulo', lang='pt')[0].translate(lang='en') # test ILI translation to English
[Synset('ewn-08876521-n'), Synset('pwn-08857529-n'), Synset('own-en-synset-08857529-n')]
wn.synsets('São Paulo', lang='pt')[0].translate(lang='en')[0].lemmas()
['Sao Paulo']
Some comments:
-
Only packaging your OWN-EN and OWN-PT lexicons together gives users no choice in which ones to download. If they just want the OWN-PT lexicon without the other one, they should be able to do so. What we did with the omw-data release https://github.com/bond-lab/omw-data/releases was offer both the individual wordnets and the full set as separate downloads.
It helps a bit (but is not necessary) to publish an index.toml file with your release (see omw-data's release, again), especially if you have multiple lexicons. If you package OWN-PT, OWN-EN, and ... OWN-ALL (?), they each need an entry with a unique project name.
Why are you packaging a version of the PWN at all? Is it just to include morphosemantic links? If so, maybe we can just rebuild the version packaged with OMW, but we should ask Christiane if we want to call it the Princeton WordNet.
The OWN-EN has gained and lost some words compared to PWN 3.0:
wn.lexicons(lang='en')
(<Lexicon ewn:2020 [en]>, <Lexicon pwn:3.0 [en]>, <Lexicon own-en:1.0.0 [en]>)
pwn30 = wn.Wordnet('pwn:3.0')
own_en = wn.Wordnet('own-en')
missing = {w.lemma() for w in pwn30.words()} - {w.lemma() for w in own_en.words()}
gained = {w.lemma() for w in own_en.words()} - {w.lemma() for w in pwn30.words()}
len(missing)
916
len(gained)
417
Most or all of these seem to be bugs in how the OWN's PWN lexicon was created, as it includes the adjposition in adjective's lemmas:
list(missing)[:20]
['guardant(ip)', 'healing(p)', 'lone(a)', 'live(a)', 'astir(p)', 'down(a)', 'prior(a)', 'motivative(a)', 'smelling(p)', 'maternal(p)', 'aghast(p)', 'anaesthetic(a)', 'on the offensive(p)', 'au naturel(p)', 'indisposed(p)', 'privy(p)', 'right-hand(a)', 'trespassing(a)', 'great(p)', 'unbeholden(p)']
[w.lemma() for w in own_en.words() if w.lemma().endswith(')')] # None in OWN-EN with (p), (a), (ip) in lemma
[]
But that's not the whole story. This will require some more digging. Notice to @fcbond https://github.com/fcbond: OMW needs to rebuild its PWN lexicons.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-924230450, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRUERILXQVZ7IX53GS3UDDCJXANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
G'day,
On Sat, Sep 25, 2021 at 2:15 PM Michael Wayne Goodman < @.***> wrote:
My primary motivation is to have English data as support for the Portuguese data, but enhancing EN data independently from the @jmccrae https://github.com/jmccrae's https://github.com/globalwordnet/english-wordnet or any other fork of Princeton data.
I can see the value of this internally, but at least for me, I am trying to move all the changes we have made in the internal wordnet at NTU into the Open English Wordnet, I would prefer to have one good wordnet we can all use!
I think I can speak for Francis as well as myself in saying that we have zero interest in maintaining or developing another fork of the Princeton WordNet. Our work in updating the OMW English Wordnet is only for compatibility with the old versions of the PWN, and we won't be fixing or augmenting them further than that. The only maintenance we plan to do is fixing remaining bugs in the format conversion. If you need to develop another fork for your work, that's fine, and we also really appreciate the issues your group has raised.
This discussion is also related to the ILI initiative and my still not clear understanding of its evolution.
With the advent of the ILI, there is no longer any need to keep using the old English wordnets as the ILI IDs are stable across versions, unlike the WNDB offsets. OMW's non-English wordnets can now link with the Open English Wordnet, or even with each other without any English wordnet (although they are still mostly devoid of their own synset relations, so they aren't terribly useful without some English wordnet).
I am still working on properly realizing the dream of adding new entries to ILI, I hope to make some real progress early next year. I would be happy to have a chat with you (and Michael if he is free) about this online.
Finally, as this pertains to Wn, the OWN lexicons do not need to be in OMW to be included in Wn's index (see OdeNet and OEWN, for example). I'm happy to include them in the next release, although I'd hope to be able to install the OWN-PT and OWN-EN lexicons separately.
I agree that adding them (separately) there is the best option.
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
G'day Francis and Michael,
Thanks for copying me in this email.
I agree with Alexandre that I also would prefer not to be tied-up to the Open English Wordnet, if possible. (Version zero of the English WordNet--only removing typos and clear mistakes from PWN seemed easy enough to achieve, but more than that seems complicated.)
I am trying to move all the changes we have made in the internal wordnet at NTU into the Open English Wordnet, I would prefer to have one good wordnet we can all use! This would be ideal, but the use cases are moving in very different directions, it seems to me.
But it also seems to me more sensible (as you both seem to be saying) to have separate downloads of OWN-PT and OWN-EN.
So if you do have a conversation about it
I would be happy to have a chat with you (and Michael if he is free) about this online. I'd try to participate, if you will have me.
best, Valeria
On Sat, Sep 25, 2021 at 2:04 AM Francis Bond @.***> wrote:
G'day,
On Sat, Sep 25, 2021 at 2:15 PM Michael Wayne Goodman < @.***> wrote:
My primary motivation is to have English data as support for the Portuguese data, but enhancing EN data independently from the @jmccrae https://github.com/jmccrae's https://github.com/globalwordnet/english-wordnet or any other fork of Princeton data.
I can see the value of this internally, but at least for me, I am trying to move all the changes we have made in the internal wordnet at NTU into the Open English Wordnet, I would prefer to have one good wordnet we can all use!
I think I can speak for Francis as well as myself in saying that we have zero interest in maintaining or developing another fork of the Princeton WordNet. Our work in updating the OMW English Wordnet is only for compatibility with the old versions of the PWN, and we won't be fixing or augmenting them further than that. The only maintenance we plan to do is fixing remaining bugs in the format conversion. If you need to develop another fork for your work, that's fine, and we also really appreciate the issues your group has raised.
This discussion is also related to the ILI initiative and my still not clear understanding of its evolution.
With the advent of the ILI, there is no longer any need to keep using the old English wordnets as the ILI IDs are stable across versions, unlike the WNDB offsets. OMW's non-English wordnets can now link with the Open English Wordnet, or even with each other without any English wordnet (although they are still mostly devoid of their own synset relations, so they aren't terribly useful without some English wordnet).
I am still working on properly realizing the dream of adding new entries to ILI, I hope to make some real progress early next year. I would be happy to have a chat with you (and Michael if he is free) about this online.
Finally, as this pertains to Wn, the OWN lexicons do not need to be in OMW to be included in Wn's index (see OdeNet and OEWN, for example). I'm happy to include them in the next release, although I'd hope to be able to install the OWN-PT and OWN-EN lexicons separately.
I agree that adding them (separately) there is the best option.
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-927091805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3HZ6GUU25RBIKPXJFFLUDWGATANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Hi @goodmami, I have just published the v1.0.0 release. Regarding your comments above:
v1.0
. You are probably right regarding the use of the last digit, but I feel like if we need to remove the last digit later it will be easier than adding it. Anyway, using the full semantic version schema makes it clear for people that know about semantic versioning and we don't need to add further explanation. Additionally, having an extra zero in the end of the releases, if we end up not creating patches, will not hurt right? index.toml
fileNext, I would be surely interested in a conversation with you and @fcbond when possible. We also need to clean up our workflow and automate the release script. If possible, once you test the release, I suspect you can close this issue.
Hi,
thanks for splitting them,
I would be happy to have a conversationt with you and Michael and Valeria, I am generally OK on Mon, Wed, or Fri mornings (your Sun, Tue or Thur afternoon/evening, I think).
On Tue, Oct 5, 2021 at 10:18 PM Alexandre Rademaker < @.***> wrote:
Hi @goodmami https://github.com/goodmami, I have just published the v1.0.0 release. Regarding your comments above:
- I kept the last digit, for now, v1.0.0 and not v1.0. You are probably right regarding the use of the last digit, but I feel like if we need to remove the last digit later it will be easier than adding it. Anyway, using the full semantic version schema makes it clear for people that know about semantic versioning and we don't need to add further explanation. Additionally, having an extra zero in the end of the releases, if we end up not creating patches, will not hurt right?
- I have now three tar.gz files: the collection own-XX (where XX is PT and EN), and the wordnets themselves: own-en and own-pt. I also don't anticipate much interest in the own-en itself, since the https://en-word.net seems to be actively maintained, but again, it doesn't hurt, we just need to be clear about the origin of each data.
- I added the index.toml file
Next, I would be surely interested in a conversation with you and @fcbond https://github.com/fcbond when possible. We also need to clean up our workflow and automate the release script. If possible, once you test the release, I suspect you can close this issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-934448015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRRRWFP7YZGSBAJYKOLUFMBENANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
Thanks for the message, Francis! I hope you don't manage to organize it for this week, as tomorrow I have major dental work scheduled. oh well! best Valeria
On Tue, Oct 5, 2021 at 7:50 PM Francis Bond @.***> wrote:
Hi,
thanks for splitting them,
I would be happy to have a conversationt with you and Michael and Valeria, I am generally OK on Mon, Wed, or Fri mornings (your Sun, Tue or Thur afternoon/evening, I think).
On Tue, Oct 5, 2021 at 10:18 PM Alexandre Rademaker < @.***> wrote:
Hi @goodmami https://github.com/goodmami, I have just published the v1.0.0 release. Regarding your comments above:
- I kept the last digit, for now, v1.0.0 and not v1.0. You are probably right regarding the use of the last digit, but I feel like if we need to remove the last digit later it will be easier than adding it. Anyway, using the full semantic version schema makes it clear for people that know about semantic versioning and we don't need to add further explanation. Additionally, having an extra zero in the end of the releases, if we end up not creating patches, will not hurt right?
- I have now three tar.gz files: the collection own-XX (where XX is PT and EN), and the wordnets themselves: own-en and own-pt. I also don't anticipate much interest in the own-en itself, since the https://en-word.net seems to be actively maintained, but again, it doesn't hurt, we just need to be clear about the origin of each data.
- I added the index.toml file
Next, I would be surely interested in a conversation with you and @fcbond https://github.com/fcbond when possible. We also need to clean up our workflow and automate the release script. If possible, once you test the release, I suspect you can close this issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-934448015, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAIPZRRRWFP7YZGSBAJYKOLUFMBENANCNFSM4XC6SJHA
. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-935309457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3HYM76LBT4Z6TM2KWPTUFO2ODANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
On Wed, Oct 6, 2021 at 11:22 AM Valeria de Paiva @.***> wrote:
Thanks for the message, Francis! I hope you don't manage to organize it for this week, as tomorrow I have major dental work scheduled. oh well!
I don't think there is any rush, so next week would be fine. I hope the surgery goes well.
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
I believe we can close that @goodmami , don't you think?
@arademaker let's leave it open until OWN-PT is actually added to Wn's index.
Also:
index.toml
entries all said "Please consult the LICENSE files.", which is less useful than just saying "http://creativecommons.org/licenses/by/4.0/" or "CC-BY 4.0". I can fix this on my end.._LICENSE
); this is not a blocking issue as Wn just ignores them, but it could be improved.Hi @goodmami, thank you for always being so vigilant!
index.toml
, you are right, it can simply link to the creative commons website, but reference the LICENSE file don't duplicate information, if the license change, I need only to change in one place.Thanks for the responses.
but reference the LICENSE file don't duplicate information, if the license change, I need only to change in one place
Wn's documentation doesn't clearly mention this, but the license field may be versioned as well. That is, you can put the license
key in the index.toml
file under the project, which serves as a default, or under a specific version. Here is how it is retrieved, where the project-level version is used only if the version-level one isn't specified:
https://github.com/goodmami/wn/blob/3411b035ca4be72a6a86629fbc196bf08b5a24d6/wn/_config.py#L146
Not sure if I take OWN-EN as a subproject of OWN-PT, although it was started with that goal.
Sorry, what I meant is that OWN-EN was created to support OWN-PT and not to be used as a standalone alternative English wordnet. At least, that's how I understand it. Thus, OWN-PT is the primary product of the OWN project and, in this case, claiming the more general own
identifier mainly for the own-pt
lexicon seemed like too big a grab. However, if Wn gets support for redirects in the index (#142), then it might not be a big deal, assuming no other existing wordnet has a claim to the own
identifier.
Maybe a longer name like OpenWordnets, OpenWordnet-PT, or OpenWordnet-EN?
Those seem better for the label
field. The id
should be short.
@arademaker, earlier you said, in reference to OWN-EN, the following:
My primary motivation is to have English data as support for the Portuguese data
Can you be more specific about what support it provides? It looks like OWN-PT has ILIs for all its synsets, it has its own synset relations, and there's no <Requires>
element on the Lexicon, so it seems like OWN-EN is not actually necessary for using OWN-PT?
hi Michael,
My primary motivation is to have English data as support for the Portuguese data Can you be more specific about what support it provides?
I think Alexandre's motivation (or at least mine) is that given that PWN is more complete than OWN, having the English version side-by-side with the Portuguese one shows human users what we're talking about. So OWN-PT works as a bilingual dictionary/thesaurus for humans and PWN is a kind of warranty that it's working as much as possible at the moment.
It also helps developers keep track of what we haven't done yet, of course.
Does this make sense to you?
Best, Valeria
On Thu, Oct 21, 2021 at 11:34 AM Michael Wayne Goodman < @.***> wrote:
@arademaker https://github.com/arademaker, earlier you said, in reference to OWN-EN, the following:
My primary motivation is to have English data as support for the Portuguese data
Can you be more specific about what support it provides? It looks like OWN-PT has ILIs for all its synsets, it has its own synset relations, and there's no
element on the Lexicon, so it seems like it's not actually necessary? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/97#issuecomment-948896607, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3HZQ4TQHIQADFCJWH33UIBMLNANCNFSM4XC6SJHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@vcvpaiva Thanks for explaining. That's great if OWN-EN assists the developers of OWN-PT with their work, but it's not clear what need it serves for other users of Wn. For instance, the bilingual dictionary functionality works just as well with the other English wordnets:
>>> pt = wn.Wordnet('own-pt')
>>> bola = pt.synsets('bola')[0]
>>> bola.definition()
'objeto redondo que é atirado ou jogado ou chutado em jogos'
>>> bola.translate(lexicon='own-en')[0].lemmas() # OWN-EN
['ball']
>>> bola.translate(lexicon='pwn:3.0')[0].lemmas() # OMW English Wordnet from PWN 3.0
['ball']
>>> bola.translate(lexicon='oewn:2021')[0].lemmas() # Open English Wordnet 2021
['ball']
So I'm hesitant at putting another English wordnet in Wn's index because I wish to avoid overwhelming users with choices. I'm not entirely opposed to OWN-EN, however, as the other English wordnets do not provide morphosemantic links (although ultimately I'd prefer that those links are just added to the Open English Wordnet).
But, regardless of whether it is added to the index, if OWN-EN is not required for the general use of OWN-PT, I don't think there is any need for the collection entry own
in the index. Basically, this means that users who want to install both would do:
>>> wn.download('own-pt')
>>> wn.download('own-en')
instead of
>>> wn.download('own')
And there's always the option of downloading directly from a URL:
>>> wn.download('https://github.com/own-pt/openWordnet-PT/releases/download/v1.0.0/own.tar.gz')
Can you be more specific about what support it provides?
@goodmami I prefer to speak for myself! ;-) Yes, @vcvpaiva has good points but I want to report my own position here.
I am completing, among other projects, the annotation of the glosses (https://github.com/own-pt/glosstag). During that work, I have been collecting many potential changes to be done in PWN. If I chance PWN, it is not PWN anymore... it is my own fork of PWN, called OWN-EN. For many applications I work with, the OWN-EN itself is the goal, it is not only a support for OWN-PT construction.
On the other hand, we don't have yet a clear way to contribute to the ILI (actually as I said above, the ILI workflow is still not very clear for me). Finally, I don't agree with the current crowdsourcing procedure in EWN... Please don't take me wrong, I am still willing to contribute, but like you, I am quite vigilant and resistant when I don't believe in something.
Given all that, the current way to keep the OWN-PT and OWN-EN in sync and be able to map to all other wordnets is by forking the almost universally accepted wordnet and applying conservative and meticulous changes on OWN-EN.
Besides all of the above, I never considered OWN-PT as a simple translation from PWN, I believe we do have applications that depend on concepts not covered by PWN. I do believe that some parts of PWN can be improved. But if I add a new concept to OWN-PT, I would like to have it replicated in the OWN-EN, because having a multilingual approach many times helps in the conceptualization of things, as @vcvpaiva said.
So it is complicated, it is not only to support OWN-PT that we forked PWN in OWN-EN. Sorry, I should have been more careful with my words before. I believe we need and can expose to others a conservative extension/adaptation of PWN.
it has its own synset relations
Until recently, in the RDF, we don't have the relations in the Portuguese part, only owl:sameAs mapping the Portuguese synsets to EN synsets. We choose to replicate the EN relations into PT to 1) turn queries easier; 2) allow independent changes in the relations. We are still releasing the RDF, so OWN-PT and OWN-EN may find other uses independent of wn
library.
there's no
element on the Lexicon
I am reading now the https://globalwordnet.github.io/schemas/ and trying to understand the semantics of the requires
tag. It may be the case that we should rethink the projection of the relations from EN to PT and, considering that we want to keep the bilingual mapping, we should always consider the PT part a projection of PT words in the structure of the EN part. On the other hand, we may have situations where a relationship between two concepts in EN may not be obvious in PT, especially when one of these concepts is not lexicalized in PT.
Not sure, all of that requires more thought. Wordnets are not dictionaries. Mapping concepts is different from providing translations of words. See http://wn.mybluemix.net/synset?id=01076514-v, in Portuguese, we don't have translations for many English verbs, we tend to use adverbs or complements to specify how/with what the action was taken. We have also long discussions like https://github.com/own-pt/openWordnet-PT/issues/182#issuecomment-924176422 about English adjectives that we don't have translations, only as phrases but they seem to be outside the scope of a lexical resource (see http://wn.mybluemix.net/synset?id=02576489-a).
But I do understand your perspective in the wn
library... it seems fine to me if you don't want to have the own
as a whole and even if you don't want to have the own-en
indexed. As you said, users can always use the complete URL to download OWN-EN if it ends up relevant to them. After all, you are the wn
owner and can ultimately decide what to maintain in the index of your library.
I think that the index you have in wn
is almost a shortcut for the users. In the end, I believe it is better to make the library independent of the discussion about what are the relevant wordnets out there, what wordnets are worth being exposed to users or not, etc.
(BTW, the situation is similar to the maintainers of package managers for OS or programming languages, right? HomeBrew? QuickLisp, stack, pip, etc)
Thanks for the additional context. So OWN-EN is a project in its own right, with its own distinguishing development methodology and features, and it is being used in some applications outside the development of OWN-PT. Then it seems fair to include it in the index. For the new user who wants an English wordnet but is unfamiliar with the various options, it might help if we provide a brief description of what sets it apart from the others. This could be in the documentation, the OWN-EN project page, or even the label
attribute of the WN-LMF <Lexicon>
element.
we don't have yet a clear way to contribute to the ILI
I expect @fcbond will get things running again in the next few months. Also see globalwordnet/cili#9.
I am reading now the https://globalwordnet.github.io/schemas/ and trying to understand the semantics of the
requires
tag.
There's also some description in McCrae et al. 2021 (which we are both coauthors of):
The purpose is to declare what, exactly, is required so that an application that hosts the wordnets can signal to the user if dependencies are unmet, or to limit the wordnets that may be used when traversing external synset relations. It is left implicit which elements or kinds of elements from the external wordnet become available to the dependent wordnet but, following the OMW’s behaviour, an application may choose to only allow synset relations and not, say, synsets or lexical entries.
That is, the <Requires>
element is just a descriptive specification of the dependency and it is up to the application to interpret what that means. Wn uses it pretty much exactly as described above: warning of unmet dependencies and specifying which lexicons may be sources of borrowed synset relations.
You could use this mechanism for OWN-PT to borrow synset relations from OWN-EN, but if you've already ported them over then it's no longer a dependency and there'd be no benefit. However if you're adding more synsets with ILI correspondences in OWN-EN and aren't porting over the relevant synset relations, then it might make sense.
After all, you are the
wn
owner and can ultimately decide what to maintain in the index of your library.
True, but I don't mean to be a fickle gatekeeper. The Wn index and database are setup such that it's not possible to have two distinct lexicons with the same id
and version
. Collections also use up an ID in this namespace even though there is no lexicon getting added to the database with that ID (just the lexicons that are part of the collection). I therefore find myself a bit guarded about adding new collections.
I think that the index you have in
wn
is almost a shortcut for the users. In the end, I believe it is better to make the library independent of the discussion about what are the relevant wordnets out there, what wordnets are worth being exposed to users or not, etc.
Yes, good points. Users who come to Wn may not look beyond the list that is provided for them, so I don't want to be unnecessarily exclusive. And like I said above, I'm not opposed to adding OWN-EN if it's meant for use beyond the development of OWN-PT and if it's different from what's already there. I'm having a harder time finding a use case for the own
collection entry as it only has two lexicons. Beyond being able to download both at once, the collection would allow you to have a label, language, and license statement for OWN as a whole. But that's about it for now.
In any case, I'm prepared to add the own
collection to the index if you want to push for it. If in the future we get some other "OWN" project that wants the ID, we can decide what to do then.
@arademaker I don't want to be unnecessarily obstructive so I've added the own
collection to the index in the latest commit, as well as the own-en
and own-pt
packages. It it becomes a problem we can remove it in a future version.
I would like to make OWN-PT data available to be directly indexed by this module instead of indexed as part of the OMW... Can you give me directions?