globalwordnet / schemas

WordNet-LMF formats
https://globalwordnet.github.io/schemas/
19 stars 11 forks source link

DTD 1.1 #52

Open arademaker opened 3 years ago

arademaker commented 3 years ago

Why partOfSpeech is an attribute of the Lemma and not an attribute of lexicalEntry?

arademaker commented 3 years ago

Why in the lemon RDF vocab, a lexicalEntry has a canonicalForm and otherForm but in the XML a lexicalEntry has a lemma and one or more Form?

jmccrae commented 3 years ago

These are all related to the original formats.

The modelling of partOfSpeech on Lemma is due to Kyoto-LMF: http://kyoto-project.eu/xmlgroup.iit.cnr.it/kyoto/index6bfa.html?option=com_content&view=article&id=143&Itemid=129 Personally, I find this weird but there is no technical reason to change this.

I would see canonicalForm in OntoLex as equivalent to Lemma in LMF and otherForm as equivalent to Form from my understanding of these models.

arademaker commented 3 years ago

Thank you, I thought we could change the schemas and DTDs in this repo freely. It would make sense to use the same terminology on both if possible.

jmccrae commented 3 years ago

We can change things of course, but there needs to be good reasons to make changes with the precedents of previous formats. I guess I can close this issue?

arademaker commented 3 years ago

No other GWA member want to make a comment?

I would vote to adapt the XML and RDF schemas to a single terminology.

goodmami commented 3 years ago

I'm not a voting member but I'll add that I agree with John. We may not live in the best possible world, but we shouldn't break backward compatibility only for (effectively) aesthetics. You may be interested in #43, however.

arademaker commented 3 years ago

but we shouldn't break backward compatibility only for (effectively) aesthetics.

The problem is how to define if a modification is only aesthetic. But fine , good to have more opinions.

Thank you for the link to the other issue.

goodmami commented 3 years ago

The problem is how to define if a modification is only aesthetic.

Fair point. For me, I'd ask if the change allows us to do something we couldn't do before, or prevent us from doing something we could do before. If not, it's aesthetic (or "non-functional", etc.).

For example, partOfSpeech on <Lemma> is effectively the same as putting it on <LexicalEntry> because every lexical entry must have exactly one lemma, so there will always be one partOfSpeech within a lexical entry. Moving it to be an attribute of <LexicalEntry> wouldn't change this. The gray area here is that one could argue that when it's on <Lemma> it is not clear that the part of speech also pertains to any other <Form> elements (i.e., siblings of the <Lemma>) within the <LexicalEntry>, but that's a matter of interpretation.

1313ou commented 3 years ago

I more or less agree with you.

Instead of the vague "what you can do and can't", I'd suggest reasoning in terms of information.

Some changes are indeed cosmetic such as renames (no info brought in or removed). Others, while not affecting the quantity of information, affect

In this case, the PartOfSpeech attribute trickles down from the file's name to Lemmas and Synsets where it hardly brings new information (except for the tricky adjectives which can split into a or s). Of course you need it after merging but it can be derived and recorded then. Also, we want maintenance scripts to find it suspicious for wn-noun files to contain Lemmas with verb parts-of-speech.

It is assumed that PartOfSpeech is propagated up from (unique) Lemma to LexicalUnit if need be. Because we don't want to repeat it at both levels. But it is a "matter of interpretation" as you say because inheritance does not usually flow from child to parent.

I've already expressed LexicalEntry and Lemma is a one-one relation and the tags should be merged. We don't need them separate. The current discussion but illustrates this point I am making and is virtually endless: either the PartOfSpeech is propagated down from parent to child or propagated up from unique child to parent.

Who cares ? But one may question whether we should have a parent-child pair here.

goodmami commented 3 years ago

@1313ou thanks for the further thoughts. While I only meant my definition as an informal rule of thumb, I agree that framing it in terms of encoded information instead of capabilities is better.

I'm not convinced that merging <Lemma> and <LexicalEntry> is a good solution because, for instance, what do we do if the <Lemma> has <Tag> child elements? Do they become siblings to other <Form> elements? That doesn't seem better to me.

1313ou commented 3 years ago

Are they distinct entities ? Is it incorrect to say a lemma 1- has (i.e. is realized as) a number of forms and 2- has a number of senses (i.e. is a member of a number of synsets) ? (I leave aside syntactic behaviour, a non-issue here)

goodmami commented 3 years ago

Are they distinct entities ?

It may help if we think of <LexicalEntry> as representing an abstract lexeme with some set of realized forms, one of which is distinguished as the canonical form, or lemma. With this in mind I think the current situation is good, except for the placement of partOfSpeech.

Is it incorrect to say a lemma 1- has (i.e. is realized as) a number of forms and

I think that is incorrect as the lemma is a realized form. It's just the canonical/dictionary/citation form. Also, not all wordnets use <Lemma>/<Form> to encode inflectional variants; namely the Japanese Wordnet, which uses it to encode alternative orthographies of the lemma.

2- has a number of senses (i.e. is a member of a number of synsets) ? (I leave aside syntactic behaviour, a non-issue here)

I wonder if we're talking about different things, as this seems backwards. The senses shouldn't change for alternative forms of the same lexical entry, but we could imagine that the syntactic behaviour could change (e.g., plural nouns in English not requiring a determiner). Currently we do not have a way to encode relationships between <SyntacticBehaviour> and specific forms, though.

1313ou commented 3 years ago

Is it incorrect to say a lemma has (i.e. is realized as) a number of forms I think that is incorrect as the lemma is a realized form. It's just the canonical/dictionary/citation form.

I'll give you that, though the DTD fails to capture this inheritance: it just copies the element definitions. Both have Pronunciations, and Tags.

I should have said 'is inflected as' or dropped the 'i.e. ..' altogether. But as you note, a lemma acts as a name ("citation"), so it stands for what it names.

Having a parent and a unique child is aesthetic in your terms. It doesn't add information. But it is ineffective in that it scatters information and more steps are required to retrieve it.

Non-collapsing them would make (more) sense if multiple lemmas were allowed for a lexical entry (for instance color + colour, realize + realise) following the practice of what most dictionaries do. The LexicalEntry tag could then group these lemmas and give substance to the feeling they refer to one and the same entity. The current DTD leaves no option but to have separate multiple lexical entries that are grouped through synset membership.

Mine is a database-design principle, as often here, that seeks effectiveness but I can grant you a point of view based on fine-grained concepts is also legitimate.

SyntacticBehaviour/SyntacticBehavior

As I advocated elsewhere SyntacticBehaviour is attached to senses. As such it shouldn't be here in the first place, but further down, under the Sense tag.

Added to that, the current DTD definition can't make a difference between reference and definition. So it merges them into one tag with

<!ATTLIST SyntacticBehaviour
  id ID #IMPLIED
  subcategorizationFrame CDATA #REQUIRED
  senses IDREFS #IMPLIED>

This makes it mandatory to repeat 'Somebody ----s somebody' 4525 times throughout the English WordNet database for instance. And it's too permissive because it fails to capture that either id OR senses is required.

Otherwise, if you want a bag to put just about anything, here is the perfect fit.