rvesse / lubm-uba

Parallelized version of the Lehigh University Benchmark (LUBM) Data Generator
29 stars 7 forks source link

Consolidation for OWL format leads to illegal XML #1

Closed LorenzBuehmann closed 6 years ago

LorenzBuehmann commented 7 years ago

I guess the consolidation for OWL doesn't work as one would have to load everything into a single Jena model in order to be able to write a single file.

What I did:

./generate.sh -u 10 -t 4 -f OWL --consolidate Full -o /tmp/lubm/owl/10 --onto http://swat.cse.lehigh.edu/onto/univ-bench.owl

When I load it via Jena it fails with

2017-07-31 10:49:51 ERROR ErrorHandlerFactory$ErrorLogger:84 - [line: 137387, col: 6 ] The processing instruction target matching "[xX][mM][lL]" is not allowed.
2017-07-31 10:49:51 WARN  Logging$class:62 - Exception while parsing RDF
org.apache.jena.riot.RiotException: [line: 137387, col: 6 ] The processing instruction target matching "[xX][mM][lL]" is not allowed.
    at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerNoWarnings.fatal(ErrorHandlerFactory.java:166)
    at org.apache.jena.riot.lang.ReaderRIOTRDFXML$ErrorHandlerBridge.fatalError(ReaderRIOTRDFXML.java:260)
    at org.apache.jena.rdfxml.xmlinput.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:47)
    at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:199)
    at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.fatalError(XMLHandler.java:229)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
    at org.apache.xerces.impl.XMLScanner.scanPIData(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanPIData(Unknown Source)
    at org.apache.xerces.impl.XMLScanner.scanPI(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentScannerImpl$TrailingMiscDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
    at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.jena.rdfxml.xmlinput.impl.RDFXMLParser.parse(RDFXMLParser.java:150)
    at org.apache.jena.rdfxml.xmlinput.ARP.load(ARP.java:118)
    at org.apache.jena.riot.lang.ReaderRIOTRDFXML.parse(ReaderRIOTRDFXML.java:135)
    at org.apache.jena.riot.lang.ReaderRIOTRDFXML.read(ReaderRIOTRDFXML.java:79)
    at org.apache.jena.riot.RDFParser.read(RDFParser.java:293)
    at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:283)
    at org.apache.jena.riot.RDFParser.parse(RDFParser.java:233)

Indeed the problem is that each consolidated OWL file contains several separate XML documents as it looks like it was simply concatenated. I also looked into the code and for the OWL format there is nothing that prevents the user from using the "Full" consolidation option.

I guess for large datasets, writing to a single file resp. single file per thread depends in the amount of memory available as it has to be loaded into a single model.

rvesse commented 6 years ago

I think I have found and fixed this issue. I am currently running through some validation tests to verify that it has been fixed appropriately.

rvesse commented 6 years ago

This is now fixed