goodmami / wn

A modern, interlingual wordnet interface for Python
https://wn.readthedocs.io/
MIT License
197 stars 19 forks source link

Validate for empty or ill-formatted definitions and examples #151

Open fcbond opened 2 years ago

fcbond commented 2 years ago

The MCR wordnet candidate had some interesting issues with definitions, although they probably apply more broadly (definitely to examples). I don't think these are bugs, but possibly something we should add a warning for? I think there are are two issues, neither of which are illegal XML.

  1. Definition contains only whitespace or is empty:

        <Synset id="spa-30-80001224-n" ili="ili-30-80001224-n">
            <Definition>
    
            </Definition>
        </Synset>

Here we should warn something like 'Definition for synset {ID} is empty, better to omit'.

  1. Definition has whitespace before and after:
    <Synset id="spa-30-80001223-n" ili="ili-30-80001223-n">
            <Definition>
                Pequeña malformación que causa la dilatación y fragilidad vascular del colon, dando como resultado una pérdida intermitente de sangre desde el tracto intestinal.
            </Definition>
        </Synset>

    Maybe here we should warn something like 'Definition for synset {ID} contains unnecessary whitespace'.

@jmccrae should we add documentation to the https://github.com/globalwordnet/schemas saying the best practice is not to pad, and to omit empty definitions (and examples), or is this too obvious?

@goodmami should we strip the text of padding before adding it to the database?

goodmami commented 2 years ago

Are you talking about warning during validation or during normal use of Wn? To me, the former seems acceptable but not the latter, as this is just bad formatting and not something that makes the data less correct or usable.

In Wn, I try to store in the database an accurate representation of what was in the WN-LMF file, such that exporting the data would result in an equivalent WN-LMF file, so I don't think stripping the definitions is a good solution. However, it would be fine with me if OMW wanted to fix these things during the compilation of its wordnets.

fcbond commented 2 years ago

Hi,

I was indeed thinking of warning during validation.

I think returning \n\t\t\n\t\t\n for the definition of a synset, rather than None, is less correct and does make it less usable. However, as you say, the ideal time to catch this is when the wordnet is made, not when we load.

On Sun, Nov 7, 2021 at 12:27 PM Michael Wayne Goodman < @.***> wrote:

Are you talking about warning during validation or during normal use of Wn? To me, the former seems acceptable but not the latter, as this is just bad formatting and not something that makes the data less correct or usable.

In Wn, I try to store in the database an accurate representation of what was in the WN-LMF file, such that exporting the data would result in an equivalent WN-LMF file, so I don't think stripping the definitions is a good solution. However, it would be fine with me if OMW wanted to fix these things during the compilation of its wordnets.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/151#issuecomment-962550096, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRVZR6QDQUA3T2FW7MLUKYE3FANCNFSM5HPJT54Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 2 years ago

I was indeed thinking of warning during validation.

Ok, good. It wasn't clear, so I changed the title. We could also check for similar whitespace issues in other elements like <ILIDefinition>, <Count>, <Tag>, and <Pronunciation>, or in attribute values, like for writtenForm or subcategorizationFrame.

I think returning \n\t\t\n\t\t\n for the definition of a synset, rather than None, is less correct and does make it less usable. However, as you say, the ideal time to catch this is when the wordnet is made, not when we load.

Right. It's less correct for the language, but it's an accurate representation of what's in the data. I don't think Wn should be deciding what it thinks a language should look like. The data should do that.

francis-dion commented 2 years ago

My understanding is that, in XML, white space after the opening tag and before the closing tag should be ignored. I didn't trace the original specs, but found multiple references including this one from adobe: XML ignores the first sequence of white space immediately after the opening tag and the last sequence of white space immediately before the closing tag. XML translates non-space characters (tab and new-line) into a space character and consolidates all multiple space characters into a single space

If the author of a wordnet wants/needs white space preserved, they should use the xml:space attribute. Here's a quote from O'Reilly's xml pocket reference: When xml:space is used on an element with a value of preserve , the whitespace in that element's content must be preserved as is by the application that processes it. The whitespace is always passed on to the processing application, but xml:space provides the application with a hint regarding how to process it.

Otherwise, I believe leading/trailing white space should definitively be stripped. I also think (albeit less strongly :-) that wn should be translating non-space characters (tab and new-line) into a space character and consolidate all multiple space characters into a single space.

goodmami commented 2 years ago

Thanks, @francis-dion, that's a good point. I'd forgotten about xml:space. The W3 spec says this about the default value:

The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space.

So when xml:space is not specified, it's not that the spacing should be stripped, but that the application should use its default whitespace processsing. So, yes, Wn could strip (and normalize) whitespace if xml:space is not present. One issue is if a wordnet author wishes to preserve whitespace. Obviously the answer is to use xml:space on the element, but the WN-LMF spec needs to declare the attribute for it to be used. From the same W3 spec:

In valid documents, this attribute, like any other, MUST be declared if it is used.