globalwordnet / schemas

WordNet-LMF formats
https://globalwordnet.github.io/schemas/
20 stars 11 forks source link

LexicalEntry ids #55

Open arademaker opened 3 years ago

arademaker commented 3 years ago

In the DTDs, a LexicalEntry have an identifier defined as https://github.com/globalwordnet/schemas/blob/master/WN-LMF-1.1.dtd#L35

The type ID, https://www.w3.org/TR/REC-xml/#id, is quite restricted and can potentially be an issue for words in other languages with accents, etc. Nevertheless, I do want to preserve the legibility and avoid creating extra artificial ids. Ideally, I would like 1-1 relation with the URI used in the RDF encoding. But we can use % scape in URIs. Any idea?

goodmami commented 3 years ago

For the XML files I think we should follow XML conventions and use ID for ids. This may also be necessary for validation tools to ensure, e.g., that IDs are unique in a document and that IDREF targets are present. There is no interpretable meaning within the ID strings, and using forms that look like lemmas is only a convenience for human annotators. The actual forms are in <Lemma> and <Form>.

If you must have a LexicalEntry ID be an accurate representation of the lemma, you might try using Punycode (update: see comments below) as it is ASCII-only and might fit in XML's range for IDs. Since IDs cannot start with hyphens, numbers, etc., you'll need to give it an appropriate prefix, which is the recommendation anyway for WN-LMF. The downside of this method is that it won't be necessarily be legible to a human. E.g., for fácil you might have own-pt-fcil-5na.

arademaker commented 3 years ago

Thank you Michael. We were considering use a hash from the lemmas but punycode seems more robust. We have a 1-1 correspondence with the lemma, maybe useful for validation.

goodmami commented 3 years ago

Actually, what was the problem with putting the unicode lemmas directly in the ID value, such as own-pt-fácil-a? I don't think that's disallowed, but the suggestions for names mention avoiding easily confusable sequences, like combining characters when a composed character exists (e.g., a + ◌́ instead of á).

jmccrae commented 3 years ago

Accented characters can be used in XML IDs so I don't really see an issue here.

The use of IDs also provides some extra validation to the DTD, namely that IDs are unique and that all references to the IDs actually exist.

1313ou commented 3 years ago

English WordNet 2021 sense IDs conform to the old ID definition but not the more recent xsd:ID. Edit: Sorry I realize now this is more relevant to English WordNet: https://github.com/globalwordnet/english-wordnet/issues/749

1313ou commented 3 years ago

@arademaker, did you consider hashes are one-way (you cannot retrieve what you hash), so eventually they are not legible? As are '_'-substitutions on a number of off-limits characters, to a lesser extent.

1313ou commented 3 years ago

@goodmami, PunyCode is rather English-centric and may be very cumbersome when more than one character cannot boil down to ascii.

goodmami commented 3 years ago

@1313ou I'd say it's English-centric only in that it's ASCII-based, but so are some other languages, e.g., Malay or Rotokas. In any case, having looked closer at the XML spec, I suggested in my second comment above that Punycode, or any such encoding, is not necessary as the accented characters can be used in IDs. To be clear, I no longer recommend using it for this purpose, and I've edited my comment above to make this more obvious.

I do suggest that we add some text to the page for ID suggestions, or maybe even a Javascript-based validator. All we have currently that I can see is:

All synsets must have an ID that starts with ID of the lexicon followed by a dash, e.g., example-en + - + local_synset_id.

The lexicon ID prefix is probably good advice for lexical entries as well because we might have lexical entries for digits or something else that shouldn't appear as the initial character in an XML ID. This means we should have recommendations for lexicon IDs (e.g., that it follows xsd:ID). I'm not sure if RDF has any similar encoding constraints, but those should be taken into account for these recommendations as well.

1313ou commented 3 years ago

I don't think the global schema must define IDs beyond the requirement that they be valid xsd:IDs. What's the problem with letting each word net define what they look like ? The basic reason is IDs are functionally opaque (and as such should not be parsed) even if it's nice for the lexicographer to recognize something in it. So "recommendations" is the good word.

goodmami commented 3 years ago

Right, I'm only suggesting that we write some "recommendations". Even if the current text says "...synsets must have...", it might be better to change that must to should. These recommendations are just to help ensure the lexicons can be validated correctly. Otherwise wordnet authors should be free to design their own conventions.

fredsonaguiar commented 3 years ago

About the discussion, in fact, as @jmccrae said

Accented characters can be used in XML IDs so I don't really see an issue here.

The problem occurs for some other characters, such as &;()+º',?–!’\, found in OWN-PT. For instances: vapor d’água from 15055442-n; from 02202047-s; Jack, o Estripador from 11077369-n, Miltiade? from 11180952-n.

fredsonaguiar commented 3 years ago

At first, the option was to, after replacing spaces by underlines, apply some other substitutions, as follows:

        # formatting lexical_entry
        written_form_ = written_form.replace(" ", "_")
        word_id = f"word-{written_form_}-{part_of_speech}"

        for char in "&;()+º',?–!’":
            word_id = word_id.replace(char, '_')
        for char in "/":
            word_id = word_id.replace(char, ':')

But, we'd like to avoid this ad-hoc solution. Maybe in a the future a new character could break the code.

fredsonaguiar commented 3 years ago

Again, makes sense to have this global (not depending on a specific language or environment) and reversible mapping instead of generating a hash or random Id:

The use of IDs also provides some extra validation to the DTD, namely that IDs are unique and that all references to the IDs actually exist.

An option can be to consider using the utf-8 hexadecimal encoding of the lemma, with part-of-speech for uniqueness. In this case, we generate, for the example before "Jack, o Estripador", from 11077369-n, the ID word-4a61636b2c206f2045737472697061646f72-n

What do you think @jmccrae @goodmami @1313ou @arademaker ?

goodmami commented 3 years ago

@FredsoNerd thanks for the additional context. Yes, many punctuation characters are excluded from the NAME production in XML, used by ID, etc., so you'll need some way to handle these. But, for reasons @jmccrae and I outlined above, I don't think the WN-LMF should change the use of ID/IDREF/IDREFS in the DTD. How you deal with these characters is thus up to you (maintainers of OWN-PT), but it would be useful to discuss possibilities so as to develop a set of general recommendations for our schemas.

First, if you collapse multiple punctuation characters into a single replacement character (as you currently do with _), you risk collisions when two entries differ only in these punctuation characters, e.g. and the hypothetical 1+. To help here, you might find some way to uniquely enumerate them (own-pt-1_-1-n, own-pt-1_-2-n, etc.), or you might find some way to encode the characters uniquely (own-pt-1-ordm-n, own-pt-1-plus-n). The latter option is easier to implement.

Let's also look at some examples from the Open English Wordnet:

So it looks like the OEWN has some ad-hoc rules for replacing those. In addition, spaces are replaced with underscores (_) and dashes (hyphens) are also used literally. The OEWN mixes shortened name-based escapes (e.g., ap for apostrophe) and hexadecimal (e.g., 003a for colon), but I'd suggest sticking to one and using established forms as in XML/HTML escapes, e.g., apos for apostrophe, so you don't need to maintain your own lookup table. Otherwise, the dash-escape-dash pattern isn't so bad, but, to avoid further collisions, you might also escape literal underscores (-lowbar-) and dashes (-dash-). In addition, only replace regular spaces with underscores; and other kinds of whitespace (tabs, non-breaking spaces, double-width spaces, newlines, etc.) get escaped.

To construct an ID, you can then:

  1. Replace disallowed ID characters with the dash-escape-dash patterns
  2. Prefix own-pt- (or some other lexicon ID followed by a dash)
  3. Append a dash and the part-of-speech

To recover the form from the ID, you do those steps in reverse. That is, after stripping the lexicon ID and part of speech and their dashes, all other dash characters indicate escape patterns to be unescaped.

arademaker commented 3 years ago

but it would be useful to discuss possibilities so as to develop a set of general recommendations for our schemas

This is the plan and the reason for opening the issue and ask @FredsoNerd to comment here. Of course I never considered chance the actual ID type in the schema.

Thank you for your suggestions @goodmami. I will discuss them with @FredsoNerd on how to implement.

jmccrae commented 3 years ago

As @goodmami pointed out we have some ad-hoc rules in OEWN for special characters:

https://github.com/globalwordnet/english-wordnet/blob/master/scripts/wordnet_yaml.py#L13

A less 'ad-hoc' approach is to replace them with XML character entities such as &apos;

1313ou commented 3 years ago

@FredsoNerd, I think the global word net does not have to superimpose constraints to specific word nets. However xsd:ID well-formedness is required because uniqueness and proper reference are involved. I have a problem with legacy sensekeys being promoted IDs because they are not conformant IDs because of the colon. However they can prove useful in the database and should be kept as an extra field, possibly as an extension.

arademaker commented 3 years ago

A less 'ad-hoc' approach is to replace them with XML character entities such as '

Yes, I like that idea, in line with @goodmami suggestion too. But & and ; are not allowed. So we can use -apos- as you do in OEWN or we can use another symbol that xsd:ID rules accept, maybe #apos?

goodmami commented 3 years ago

maybe #apos

The # character is also excluded punctuation. There is a small set of explicitly allowed ASCII punctuation. From the XML spec:

Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

That is, :, -, ., _, and ·. As the first character, however, only : and _ are allowed from that set. If we go with the xsd:id definition, : is also excluded, in any position. Unfortunately, the middle dot is not easy to type (at least on US keyboards), so we're really down to 3 usable ASCII punctuation characters: -, ., and _.

arademaker commented 3 years ago

One extra puzzle to me. Why xml uses begin/end marks (& and ;)? Eventually can we run into any trouble by using only s single mark in -appos-?