Closed arademaker closed 3 years ago
For the record, back in Oct/2016 it was validating correctly using John McRae's tool:
fcbr@FCBR-TP:~/repos/gwn/gwn-scala-api$ ./gwn -v -i /tmp/own-pt.xml -f WNLMF
Validation successful
| | Lang | Words | Synsets |
|:------|:----:|:---------:|:---------:|
| ownpt | pt | 55143 | 117659 |
The code that generates LMF is here.
I am not getting it. We have not changed OWN-PT at all recently (since Oct 2016), so it must mean that they changed their validation tools, correct? Do we have examples of failures? we need to know if it's a bug on their validation tool or a bug that we had before and it was not caught before...
They changed the DTD
Francis said:
It still does not validate!
$ xmlstarlet val -e -S -d WN-LMF.dtd own-pt.lmf
own-pt.lmf:5.0: Element Lemma content does not follow the DTD, expecting (Tag)*, got (Sense )
Please make sure it validates, and then try to upload it. Then we can add some interesting words, ...
The most predominant error was that they've changed the DTD so that the Sense element is not a Son of Lemma but a sibling. You can see an example of this at https://github.com/globalwordnet/schemas/blob/master/example.xml. I am working to fix the code to correctly generate these elements.
Another recurrent error was a syntax one. Some id attributes were generated with spaces like "own-pt-gasolina sem chumbo-n". I still have to investigate how to deal with it, maybe changing the way the ids are generated.
@joaomarcosgris we can substitute space for "_". But we can also see if we can use other pattern for ids.
"own-pt.lmf:35: element SenseRelation: validity error : Value "event" for attribute relType of SenseRelation is not among the enumerated set". Now, relType has to be one of the following values "antonym|also|participle|pertainym|derivation|domain_topic|has_domain_topic|domain_region|has_domain_region|exemplifies|is_exemplified_by|similar|other"
Another interesting type of error starts at line 88 of lmf file.
<LexicalEntry id="own-pt-dormir-v">
<Lemma writtenForm="dormir" partOfSpeech="v">
<Sense id="own-pt-00017282-v-dormir" synset="own-pt-00017282-v">
</Sense>
</Lemma>
<Lemma writtenForm="dormir" partOfSpeech="v">
<Sense id="own-pt-00014742-v-dormir" synset="own-pt-00014742-v">
</Sense>
</Lemma>
<Lemma writtenForm="dormir" partOfSpeech="v">
<Sense id="own-pt-00014405-v-dormir" synset="own-pt-00014405-v">
</Sense>
</Lemma>
</LexicalEntry>
"own-pt.lmf:88: element LexicalEntry: validity error : Element LexicalEntry content does not follow the DTD, expecting (Lemma , Form , Sense , SyntacticBehaviour*), got (Lemma Lemma Lemma )"
This means that each LexicalEntry element can have only one Lemma but it can have more than one Sense. I think that if we solve the Lemma/Sense Siblings problem that I've written above we will solve this one as a bonus.
One strange thing is that the LMF was validated at some point, even with syntax errors like the spaces between the id attribute.
@joaomarcosgris was I said, in the past @fcbr used the web service for validation. So maybe the service had a bug.
Another strange thing is that the LMF file doesn't have any Example or SynsetRelation element. At the same time, all the attribute values for relType on SenseRelation elements are invalid.
Intermediary solution in https://github.com/own-pt/wordnet-editor/commit/c10151e3cb751ce1ee25898f7b697c1b2de3104b
documentation at http://globalwordnet.github.io/schemas/
In this repository, we've ownlmf_format responsible for formatting OWN-PT and OWN-EN as a valid LMF-1.1 instance. In each case below we parse the files for that language, not all for "pt" and "en" at once.
For OWN-PT, run as:
python ownlmf_format.py own-files/own-pt-* ili-map.ttl -o own-pt-lmf-1.0.xml -li own-pt -lb OpenWordnet-PT -vr 1.0 -lg pt -cs 1.0 --status checked -v
For OWN-en, run as:
python ownlmf_format.py own-files/own-en-* ili-map.ttl -o own-en-lmf-1.0.xml -li own-en -lb OpenWordnet-EN -vr 1.0 -lg en -cs 1.0 --status checked -v
For details and other options, such as --url
, --email
and --status
, try the help flag -h
.
Nice, are we ready for creating the first release and attach the XML files?
@FredsoNerd, the code is ready to transform the RDF into an XML valid against the GWA DTD 1.1 and 1.0, please close this issue. The final generation of the XML is part of the #168
Many errors are reported . The DTD I took from https://github.com/globalwordnet/schemas. More info at http://globalwordnet.github.io/schemas/.