own-pt / openWordnet-PT

OpenWordnet-PT: an open access wordnet for Portuguese
http://openwordnet-pt.org
Other
154 stars 35 forks source link

LMF format invalid #121

Closed arademaker closed 3 years ago

arademaker commented 7 years ago
$ xmllint --noout --dtdvalid ../gwa-schemas/WN-LMF.dtd own-pt.lmf

Many errors are reported . The DTD I took from https://github.com/globalwordnet/schemas. More info at http://globalwordnet.github.io/schemas/.

fcbr commented 7 years ago

For the record, back in Oct/2016 it was validating correctly using John McRae's tool:

fcbr@FCBR-TP:~/repos/gwn/gwn-scala-api$ ./gwn -v -i /tmp/own-pt.xml -f WNLMF
Validation successful
|       | Lang | Words     | Synsets   |
|:------|:----:|:---------:|:---------:|
| ownpt |  pt  |     55143 |    117659 |

The code that generates LMF is here.

vcvpaiva commented 7 years ago

I am not getting it. We have not changed OWN-PT at all recently (since Oct 2016), so it must mean that they changed their validation tools, correct? Do we have examples of failures? we need to know if it's a bug on their validation tool or a bug that we had before and it was not caught before...

arademaker commented 7 years ago

They changed the DTD

arademaker commented 7 years ago

Francis said:

It still does not validate!

$ xmlstarlet val -e -S -d WN-LMF.dtd own-pt.lmf
own-pt.lmf:5.0: Element Lemma content does not follow the DTD, expecting (Tag)*, got (Sense )

Please make sure it validates, and then try to upload it. Then we can add some interesting words, ...

gris commented 7 years ago

The most predominant error was that they've changed the DTD so that the Sense element is not a Son of Lemma but a sibling. You can see an example of this at https://github.com/globalwordnet/schemas/blob/master/example.xml. I am working to fix the code to correctly generate these elements.

gris commented 7 years ago

Another recurrent error was a syntax one. Some id attributes were generated with spaces like "own-pt-gasolina sem chumbo-n". I still have to investigate how to deal with it, maybe changing the way the ids are generated.

arademaker commented 7 years ago

@joaomarcosgris we can substitute space for "_". But we can also see if we can use other pattern for ids.

gris commented 7 years ago

"own-pt.lmf:35: element SenseRelation: validity error : Value "event" for attribute relType of SenseRelation is not among the enumerated set". Now, relType has to be one of the following values "antonym|also|participle|pertainym|derivation|domain_topic|has_domain_topic|domain_region|has_domain_region|exemplifies|is_exemplified_by|similar|other"

gris commented 7 years ago

Another interesting type of error starts at line 88 of lmf file.

    <LexicalEntry id="own-pt-dormir-v">
        <Lemma writtenForm="dormir" partOfSpeech="v">
            <Sense id="own-pt-00017282-v-dormir" synset="own-pt-00017282-v">

            </Sense>
        </Lemma>
        <Lemma writtenForm="dormir" partOfSpeech="v">
            <Sense id="own-pt-00014742-v-dormir" synset="own-pt-00014742-v">

            </Sense>
        </Lemma>
        <Lemma writtenForm="dormir" partOfSpeech="v">
            <Sense id="own-pt-00014405-v-dormir" synset="own-pt-00014405-v">

            </Sense>
        </Lemma>
    </LexicalEntry>

"own-pt.lmf:88: element LexicalEntry: validity error : Element LexicalEntry content does not follow the DTD, expecting (Lemma , Form , Sense , SyntacticBehaviour*), got (Lemma Lemma Lemma )"

This means that each LexicalEntry element can have only one Lemma but it can have more than one Sense. I think that if we solve the Lemma/Sense Siblings problem that I've written above we will solve this one as a bonus.

gris commented 7 years ago

One strange thing is that the LMF was validated at some point, even with syntax errors like the spaces between the id attribute.

arademaker commented 7 years ago

@joaomarcosgris was I said, in the past @fcbr used the web service for validation. So maybe the service had a bug.

gris commented 7 years ago

Another strange thing is that the LMF file doesn't have any Example or SynsetRelation element. At the same time, all the attribute values for relType on SenseRelation elements are invalid.

arademaker commented 7 years ago

Intermediary solution in https://github.com/own-pt/wordnet-editor/commit/c10151e3cb751ce1ee25898f7b697c1b2de3104b

arademaker commented 7 years ago

documentation at http://globalwordnet.github.io/schemas/

fredsonaguiar commented 3 years ago

In this repository, we've ownlmf_format responsible for formatting OWN-PT and OWN-EN as a valid LMF-1.1 instance. In each case below we parse the files for that language, not all for "pt" and "en" at once.

For OWN-PT, run as:

python ownlmf_format.py own-files/own-pt-* ili-map.ttl -o own-pt-lmf-1.0.xml -li own-pt -lb OpenWordnet-PT -vr 1.0 -lg pt -cs 1.0 --status checked -v

For OWN-en, run as:

python ownlmf_format.py own-files/own-en-* ili-map.ttl -o own-en-lmf-1.0.xml -li own-en -lb OpenWordnet-EN -vr 1.0 -lg en -cs 1.0 --status checked -v

For details and other options, such as --url, --email and --status, try the help flag -h.

arademaker commented 3 years ago

Nice, are we ready for creating the first release and attach the XML files?

arademaker commented 3 years ago

@FredsoNerd, the code is ready to transform the RDF into an XML valid against the GWA DTD 1.1 and 1.0, please close this issue. The final generation of the XML is part of the #168