globalwordnet / schemas

WordNet-LMF formats
https://globalwordnet.github.io/schemas/
20 stars 11 forks source link

Breaking changes #43

Open goodmami opened 3 years ago

goodmami commented 3 years ago

This issue is meant to collect the changes we would like to make to WN-LMF but have not because doing so would break backward compatibility. When we get to a 2.0 version we have a chance for some simplification and belt-tightening, so it would be a shame if we miss some and have to wait for the next major version.

For better discussion, these issues could be broken up into separate issues (maybe with an appropriate label or milestone to group them?).

Deferred Changes

These are changes we would have made in WN-LMF 1.1 if backwards compatibility were not an issue.

Proposed Changes

These are new changes that we might consider

fcbond commented 3 years ago

Hi,

I was thinking of using Tag much more broadly, for example to show roots in Malay, irregular (broken) plurals in Arabic, voweled and vowelless variants in Hebrew and so forth. So I don't think it can be replaced by just script.

On Mon, Feb 8, 2021 at 2:19 PM Michael Wayne Goodman < notifications@github.com> wrote:

This issue is meant to collect the changes we would like to make to WN-LMF but have not because doing so would break backward compatibility. When we get to a 2.0 version we have a chance for some simplification and belt-tightening, so it would be a same if we miss some and have to wait for the next major version.

For better discussion, these issues could be broken up into separate issues (maybe with an appropriate label or milestone to group them?). Deferred Changes

These are changes we would have made in WN-LMF 1.1 if backwards compatibility were not an issue.

  • Remove from ; it became a child of
  • Remove the senses attribute from ; these associations are handled by the subcat attribute on elements
  • Make the id attribute on required

Proposed Changes

These are new changes that we might consider

-

Remove ? The use case presented in Bond et al. 2020 ("Some Issues with Building a Multilingual Wordnet") seems more elegantly handled by the script attribute on and :

Above, if script were limited to ISO15924 script names, then all 3 pinyin variants would be just "Latn", so I used BCP-47-like tags minus the language and region names. The "pinyin" variant and private-use tags "numeric" and "simple" can be used to distinguish them.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/globalwordnet/schemas/issues/43, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRQ4MAM3UHZUY2BZZSLS5565ZANCNFSM4XIL72ZA .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

lmorgadodacosta commented 3 years ago

I agree with Francis. I would very much like to keep the Tag to store flexible annotations on Lemmas and Forms. These won't be meaningful for OMW (as it reads the LMF) but they can be displayed as a list of tag-values.

Also, if projects keep using Tag as a flexible layer to store information, OMW can also better understand what special "tags" could be embedded in the DTD as 'officially supported' with an agreed upon format/meaning.

On Mon, Feb 8, 2021 at 3:14 PM Francis Bond notifications@github.com wrote:

Hi,

I was thinking of using Tag much more broadly, for example to show roots in Malay, irregular (broken) plurals in Arabic, voweled and vowelless variants in Hebrew and so forth. So I don't think it can be replaced by just script.

On Mon, Feb 8, 2021 at 2:19 PM Michael Wayne Goodman < notifications@github.com> wrote:

This issue is meant to collect the changes we would like to make to WN-LMF but have not because doing so would break backward compatibility. When we get to a 2.0 version we have a chance for some simplification and belt-tightening, so it would be a same if we miss some and have to wait for the next major version.

For better discussion, these issues could be broken up into separate issues (maybe with an appropriate label or milestone to group them?). Deferred Changes

These are changes we would have made in WN-LMF 1.1 if backwards compatibility were not an issue.

  • Remove from ; it became a child of
  • Remove the senses attribute from ; these associations are handled by the subcat attribute on elements
  • Make the id attribute on required

Proposed Changes

These are new changes that we might consider

-

Remove ? The use case presented in Bond et al. 2020 ("Some Issues with Building a Multilingual Wordnet") seems more elegantly handled by the script attribute on and :

Above, if script were limited to ISO15924 script names, then all 3 pinyin variants would be just "Latn", so I used BCP-47-like tags minus the language and region names. The "pinyin" variant and private-use tags "numeric" and "simple" can be used to distinguish them.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/globalwordnet/schemas/issues/43, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAIPZRQ4MAM3UHZUY2BZZSLS5565ZANCNFSM4XIL72ZA

.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/globalwordnet/schemas/issues/43#issuecomment-774927788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB73XHQSPEKS6G3ENKHVPZTS56FNBANCNFSM4XIL72ZA .

goodmami commented 3 years ago

@fcbond, @lmorgadodacosta thanks for the context. I haven't seen tags used at all aside from in the paper, so if there's a good and active use case (except the "script" one, for which I stand by my previous statement) then it makes sense to leave it in. For instance, I've been wondering how to distinguish various lemmas+forms in EWN, like stimulus/stimuli. Could be with <Tag>:

      <Lemma partOfSpeech="n" writtenForm="stimulus" />
      <Form writtenForm="stimuli">
        <Tag category="number">PL</Tag>
      </Form>

Relatedly, I've been wondering which elements of WN-LMF are meant for modeling a language's wordnet and which are for peripheral annotation tasks or processes. For instance, <Count> doesn't really model something true about a language, but something that can be computed for some corpora, so why is this part of WN-LMF? And <ILIDefinition> is only used when a wordnet is the vehicle by which new ILI candidates are proposed, otherwise those definitions are included with the ILI resource, so it seems like there could be another channel for proposing candidates (e.g., by creating issues at https://github.com/globalwordnet/cili/).

fcbond commented 3 years ago

Hi,

I think frequency information is a part of knowledge of language. Any corpus count is only an imperfect sample, but I would rather make available what we have when we have it.

For the ILI I think we tried to get a balance between purely modelling and generally useful. We only want candidates that come with a wordnet, and packaging them together makes this easier to manage.

On Mon, Feb 8, 2021 at 3:57 PM Michael Wayne Goodman < notifications@github.com> wrote:

@fcbond https://github.com/fcbond, @lmorgadodacosta https://github.com/lmorgadodacosta thanks for the context. I haven't seen tags used at all aside from in the paper, so if there's a good and active use case (except the "script" one, for which I stand by my previous statement) then it makes sense to leave it in. For instance, I've been wondering how to distinguish various lemmas+forms in EWN, like stimulus/ stimuli. Could be with :

  <Lemma partOfSpeech="n" writtenForm="stimulus" />
  <Form writtenForm="stimuli">
    <Tag category="number">PL</Tag>
  </Form>

Relatedly, I've been wondering which elements of WN-LMF are meant for modeling a language's wordnet and which are for peripheral annotation tasks or processes. For instance, doesn't really model something true about a language, but something that can be computed for some corpora, so why is this part of WN-LMF? And is only used when a wordnet is the vehicle by which new ILI candidates are proposed, otherwise those definitions are included with the ILI resource, so it seems like there could be another channel for proposing candidates (e.g., by creating issues at https://github.com/globalwordnet/cili/).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/globalwordnet/schemas/issues/43#issuecomment-774947201, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRXHEMWP5LRWK466JHTS56J5BANCNFSM4XIL72ZA .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 3 years ago

I think frequency information is a part of knowledge of language. Any corpus count is only an imperfect sample, but I would rather make available what we have when we have it.

Sorry, I think my "something true" comment wasn't accurate. I was trying to draw a line between "gold", human-added information and the automatically computed information. I think the line is even blurrier because those computed counts are, I think, from human annotations.

So you have this information and you'd like to make it available. That's great, but I still think it would be better as a separate resource, similar to how the information-content (IC) data files are distributed separately. It's also easier that way to track where the counts came from, e.g., in a file called ntumc-pwn-3.0-counts.tsv instead of having a dc:source="NTUMC" attribute on every <Count> element in the XML file.

Also, practically, I have not seen any wordnets distributed with this information (I suspect you use it internally for annotation projects), and trying to model it properly in Wn complicates the database schema and code. I guess I'm arguing for a worse-is-better approach.

We only want candidates that come with a wordnet, and packaging them together makes this easier to manage.

My position here is essentially the same as my last argument regarding schema/code complexity. It seems like the format has been refitted with a feature that's only relevant for CILI's development and not for modeling a wordnet. A proposed ILI with ili="in" must be special-cased: it's not the case that all synsets with ili="in" are interlingually aligned, <ILIDefinition> when the ili attribute is not "in" should probably be ignored as the definitions come from CILI, etc. I think it would be better to propose new ILIs by declaring the synset they belong to, such as in a TSV file (examples from EWN 2020):

synset  definition
ewn-05698967-n  the barrier preventing Blacks from participating in various activities with whites
ewn-05822120-n  (plural) something that reminds you of someone or something
...

Furthermore, we cannot express in a DTD that <ILIDefinition> is required when the ili attribute has value "in" and is forbidden (?) otherwise. It's just not a good fit.

goodmami commented 3 years ago

Also note that I've updated the original issue text. I added some attributes as candidates for removal. I understand that they had some original purpose but I don't see evidence of their use, so it's worth discussing whether they can be removed. Generally, though, these attributes are relatively simple to model in the database and they can just not appear in the XML when unused, but they can still cause surprises (e.g., see here).