oxigraph / rio

RDF parsers library
Apache License 2.0
87 stars 10 forks source link

Wrong behavior in Wordnet-LMF XML Format #33

Closed ryou90 closed 3 years ago

ryou90 commented 3 years ago

Hello, is it possible to extend Rio to support the WN-LMF (https://globalwordnet.github.io/schemas/) format as well?

Currently I use the library to import different wordnets into a triple store. The XML pattern parser works fine.

However, under certain circumstances, the behavior of the parser is faulty:

XML in WN-LMF format:

<LexicalEntry id="ewn-symbolization-n">
  <Lemma partOfSpeech="n" writtenForm="symbolization" />
  <Sense id="ewn-symbolization-n-06614677-01" synset="ewn-06614677-n" dc:identifier="symbolization%1:10:00::">
        <SenseRelation relType="derivation" target="ewn-symbolize-v-00989629-01" />
  </sense>
   <Sense id="ewn-symbolization-n-05773412-02" synset="ewn-05773412-n" dc:identifier="symbolization%1:09:00::">
       <SenseRelation relType="derivation" target="ewn-symbolize-v-00837915-02" />
   </sense>
   <Sense id="ewn-symbolization-n-00413284-02" synset="ewn-00413284-n" dc:identifier="symbolization%1:04:00::" /></LexicalEntry>

Output:

...
['riog000034', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'LexicalEntry' ]
('riog000034', 'id', '"ewn-symbolization-n"')
('riog000035', 'writtenForm', '"symbolization"')
('riog000035', 'partOfSpeech', '"n"')
('riog000034', 'Lemma', 'riog000035')
('riog000036', 'http://purl.org/dc/elements/1.1/identifier', '"symbolization%1:10:00::"')
('riog000036', 'synset', '"ewn-06614677-n"')
('riog000036', 'id', '"ewn-symbolization-n-06614677-01"')
('riog000037', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'SenseRelation')
('riog000037', 'target', '"ewn-symbolize-v-00989629-01"')
('riog000037', 'relType', '"derivation"')
('riog000034', 'sense', 'riog000037')
('riog000038', 'http://purl.org/dc/elements/1.1/identifier', '"symbolization%1:09:00::"')
('riog000038', 'synset', '"ewn-05773412-n"')
('riog000038', 'id', '"ewn-symbolization-n-05773412-02"')
('riog000039', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'SenseRelation')
('riog000039', 'target', '"ewn-symbolize-v-00837915-02"')
('riog000039', 'relType', '"derivation"')
('riog000034', 'sense', 'riog000039')
('riog000040', 'http://purl.org/dc/elements/1.1/identifier', '"symbolization%1:04:00::"')
('riog000040', 'synset', '"ewn-00413284-n"')
('riog000040', 'id', '"ewn-symbolization-n-00413284-02"')
('riog000034', 'sense', 'riog000040')
...

If the pattern is LexicalEntry - Sense - LexicalEntry , it works fine. But if SenseRelation occurs within Sense, no ID linking at all is generated: ('riog000034', 'Sense', 'riog000037') -> LexicalEntry points directly to SenseRelation, skipping Sense. However, it should actually be ('riog000034', 'Sense', 'riog000036') and additionally ('riog000036', 'SenseRelation', 'riog000037') must exist.

Thx Robert

Tpt commented 3 years ago

Hi! Thank you for being interested in Rio.

Rio is focusing on RDF formats and WN-LMF is not an RDF-focused format. So, I am afraid supporting this format is a bit out of scope of the Rio library.

The parsing works this way because in RDF/XML the tag nesting order is more or less "entity" > "relation" > "entity"... This is not at all what WN-LMF is doing so I believe that the simplest way to go is maybe to parse this data as XML using a plain XML parser and then generate some clean RDF from it. This would allow you to properly generate URIs for the different entities using the WordNet ids instead of using the automatically generated rio... blank node ids. An other probably even simpler option might be to just use the GlobalWordNet RDF format. The Rio Turtle parser should be able to parse this format properly.

If you want to do WN-LMF parsing you could use the minidom DOM implementation if the data is small enough to fit in memory. If you want something very fast but much more complex you could use the quick-xml even-based parser (it's the XML parsing library that Rio is using internally).

ryou90 commented 3 years ago

Hi! Thank you for your fast answer :) I found another solution for my problem. Using the globalwordnet converter, I have translate my files to the simple N-Triple format. After that, reading the files with Rio it's a easy step :)