globalwordnet / schemas

WordNet-LMF formats
https://globalwordnet.github.io/schemas/
20 stars 11 forks source link

XSD 1.1 schema #45

Open 1313ou opened 3 years ago

1313ou commented 3 years ago

XSD version of release 1.1 (Needs more testing but validating migrated current english-wordnet data passes)

PS : Migrated english-wordnet data is obtained this way:

sed 's|http://purl.org/dc/elements/1.1/|https://globalwordnet.github.io/schemas/dc/|g' $xml

goodmami commented 3 years ago

Thanks! But I'm a bit confused about the goal of this PR. Firstly, while this PR is ostensibly about introducing an XSD schema for WN-LMF, it also replaces the top-level README with something about the Extended English WordNet pipeline, which is unrelated. Similarly, there are some .xsd files which appear to be relevant only for PWN or EWN which, I would think, should be managed by those respective projects (perhaps OMW in the case of the WN-LMF release of PWN).

Also:

1313ou commented 3 years ago

it also replaces the top-level README with something about the Extended English WordNet pipeline, which is unrelated

True this was imported by mistake (because it shares the same fork as XEWN schemas). This has been fixed in commit #https://github.com/globalwordnet/schemas/pull/45/commits/defccfc62811ec616ba79ac6129e7dcb0a831353

Is this related to #10

Yes (it uses the same modularity and philosophy) and no (validates different data)

XSD is more powerful than DTD

DTDs are outdated (they survive but should be ditched). I'm not going to repeat the literature here. Suffice it to say XSD introduces types. For example pronunciation data could be typed to use IPA, anything not IPA would be rejected (by comparison CDATA does not validate anything).

I'm quite happy with the validation as you want it to be. Using a stricter one has proved helpful to the projects I am conducting (and has raised errors in the current one that had otherwise gone unnoticed).

Besides, I see validation as a means of ensuring data coherence, not as the description of a form per se. You may achieve it in different ways, you may want different degrees of coherence (WN/EWN differ in the admissible characters in lemmas, EWN so far having ASCII + oddities), you may want to define supersets and subsets, you may want to have various extension mechanisms (yours should be optional separate DTD not mandatory core (1)).

In a word it doesn't have to be unique (2).

(1) BTW I am not to keen on importing external data in a way that is not IDREF (2) Hence the building blocks.

jmccrae commented 3 years ago

The XSD schema validation is likely very useful.

I don't think we should be supporting specific validation for individual projects like Open English WordNet and Princeton WordNet. Particularly, we have not yet fully decided how EWN will update with this schema update so we risk this becoming out of date.