Open arademaker opened 3 years ago
Why in the lemon RDF vocab, a lexicalEntry has a canonicalForm
and otherForm
but in the XML a lexicalEntry has a lemma
and one or more Form
?
These are all related to the original formats.
The modelling of partOfSpeech
on Lemma is due to Kyoto-LMF: http://kyoto-project.eu/xmlgroup.iit.cnr.it/kyoto/index6bfa.html?option=com_content&view=article&id=143&Itemid=129 Personally, I find this weird but there is no technical reason to change this.
I would see canonicalForm
in OntoLex as equivalent to Lemma
in LMF and otherForm
as equivalent to Form
from my understanding of these models.
Thank you, I thought we could change the schemas and DTDs in this repo freely. It would make sense to use the same terminology on both if possible.
We can change things of course, but there needs to be good reasons to make changes with the precedents of previous formats. I guess I can close this issue?
No other GWA member want to make a comment?
I would vote to adapt the XML and RDF schemas to a single terminology.
I'm not a voting member but I'll add that I agree with John. We may not live in the best possible world, but we shouldn't break backward compatibility only for (effectively) aesthetics. You may be interested in #43, however.
but we shouldn't break backward compatibility only for (effectively) aesthetics.
The problem is how to define if a modification is only aesthetic. But fine , good to have more opinions.
Thank you for the link to the other issue.
The problem is how to define if a modification is only aesthetic.
Fair point. For me, I'd ask if the change allows us to do something we couldn't do before, or prevent us from doing something we could do before. If not, it's aesthetic (or "non-functional", etc.).
For example, partOfSpeech
on <Lemma>
is effectively the same as putting it on <LexicalEntry>
because every lexical entry must have exactly one lemma, so there will always be one partOfSpeech
within a lexical entry. Moving it to be an attribute of <LexicalEntry>
wouldn't change this. The gray area here is that one could argue that when it's on <Lemma>
it is not clear that the part of speech also pertains to any other <Form>
elements (i.e., siblings of the <Lemma>
) within the <LexicalEntry>
, but that's a matter of interpretation.
I more or less agree with you.
Instead of the vague "what you can do and can't", I'd suggest reasoning in terms of information.
Some changes are indeed cosmetic such as renames (no info brought in or removed). Others, while not affecting the quantity of information, affect
In this case, the PartOfSpeech attribute trickles down from the file's name to Lemmas and Synsets where it hardly brings new information (except for the tricky adjectives which can split into a or s). Of course you need it after merging but it can be derived and recorded then. Also, we want maintenance scripts to find it suspicious for wn-noun files to contain Lemmas with verb parts-of-speech.
It is assumed that PartOfSpeech is propagated up from (unique) Lemma to LexicalUnit if need be. Because we don't want to repeat it at both levels. But it is a "matter of interpretation" as you say because inheritance does not usually flow from child to parent.
I've already expressed LexicalEntry and Lemma is a one-one relation and the tags should be merged. We don't need them separate. The current discussion but illustrates this point I am making and is virtually endless: either the PartOfSpeech is propagated down from parent to child or propagated up from unique child to parent.
Who cares ? But one may question whether we should have a parent-child pair here.
@1313ou thanks for the further thoughts. While I only meant my definition as an informal rule of thumb, I agree that framing it in terms of encoded information instead of capabilities is better.
I'm not convinced that merging <Lemma>
and <LexicalEntry>
is a good solution because, for instance, what do we do if the <Lemma>
has <Tag>
child elements? Do they become siblings to other <Form>
elements? That doesn't seem better to me.
Are they distinct entities ? Is it incorrect to say a lemma 1- has (i.e. is realized as) a number of forms and 2- has a number of senses (i.e. is a member of a number of synsets) ? (I leave aside syntactic behaviour, a non-issue here)
Are they distinct entities ?
It may help if we think of <LexicalEntry>
as representing an abstract lexeme with some set of realized forms, one of which is distinguished as the canonical form, or lemma. With this in mind I think the current situation is good, except for the placement of partOfSpeech
.
Is it incorrect to say a lemma 1- has (i.e. is realized as) a number of forms and
I think that is incorrect as the lemma is a realized form. It's just the canonical/dictionary/citation form. Also, not all wordnets use <Lemma>
/<Form>
to encode inflectional variants; namely the Japanese Wordnet, which uses it to encode alternative orthographies of the lemma.
2- has a number of senses (i.e. is a member of a number of synsets) ? (I leave aside syntactic behaviour, a non-issue here)
I wonder if we're talking about different things, as this seems backwards. The senses shouldn't change for alternative forms of the same lexical entry, but we could imagine that the syntactic behaviour could change (e.g., plural nouns in English not requiring a determiner). Currently we do not have a way to encode relationships between <SyntacticBehaviour>
and specific forms, though.
Is it incorrect to say a lemma has (i.e. is realized as) a number of forms I think that is incorrect as the lemma is a realized form. It's just the canonical/dictionary/citation form.
I'll give you that, though the DTD fails to capture this inheritance: it just copies the element definitions. Both have Pronunciations, and Tags.
I should have said 'is inflected as' or dropped the 'i.e. ..' altogether. But as you note, a lemma acts as a name ("citation"), so it stands for what it names.
Having a parent and a unique child is aesthetic in your terms. It doesn't add information. But it is ineffective in that it scatters information and more steps are required to retrieve it.
Non-collapsing them would make (more) sense if multiple lemmas were allowed for a lexical entry (for instance color + colour, realize + realise) following the practice of what most dictionaries do. The LexicalEntry tag could then group these lemmas and give substance to the feeling they refer to one and the same entity. The current DTD leaves no option but to have separate multiple lexical entries that are grouped through synset membership.
Mine is a database-design principle, as often here, that seeks effectiveness but I can grant you a point of view based on fine-grained concepts is also legitimate.
SyntacticBehaviour/SyntacticBehavior
As I advocated elsewhere SyntacticBehaviour is attached to senses. As such it shouldn't be here in the first place, but further down, under the Sense tag.
Added to that, the current DTD definition can't make a difference between reference and definition. So it merges them into one tag with
<!ATTLIST SyntacticBehaviour
id ID #IMPLIED
subcategorizationFrame CDATA #REQUIRED
senses IDREFS #IMPLIED>
This makes it mandatory to repeat 'Somebody ----s somebody' 4525 times throughout the English WordNet database for instance. And it's too permissive because it fails to capture that either id OR senses is required.
Otherwise, if you want a bag to put just about anything, here is the perfect fit.
Why partOfSpeech is an attribute of the Lemma and not an attribute of lexicalEntry?