spechub / Hets

The Heterogeneous Tool Set
http://hets.eu
GNU General Public License v2.0
57 stars 19 forks source link

use SAX parser for XML parsing #1248

Open sternk opened 10 years ago

sternk commented 10 years ago

Reported by till and assigned to maeder Migrated from http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248

xmlHexpat.hs xmlHaXml.hs xmlHexpatTree.hs xmlLight.hs tagSoup.hs


ask the Haskell mailing list for an efficient SAX parser in order to solve the current space leaks.

sternk commented 10 years ago

Comment by maeder Migrated from http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248#comment:2


the hexpat SAX parser is able to process 114 MB in less than 30 seconds within 30 KB memory. (Building a tree goes up to 3 GB.)

./xmlHexpat < CVRG_EPOntology.owl.xml.xml > CVRG_EPOntology.owl.xml.xml.xml

-rw-r--r-- 1 maeder wimi 114M 17. Feb 19:24 CVRG_EPOntology.owl.xml
-rw-r--r-- 1 maeder wimi 114M 18. Feb 11:05 CVRG_EPOntology.owl.xml.xml
-rw-r--r-- 1 maeder wimi 114M 18. Feb 11:09 CVRG_EPOntology.owl.xml.xml.xml

The simple XML (Light) library crashes with:

xmlLight: internal error: getMBlock: mmap: Operation not permitted
    (GHC version 7.6.2 for i386_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

The HaXml library additionally needs "+RTS -K20M" stack space before crashing, too.

sternk commented 10 years ago

Comment by maeder Migrated from http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248#comment:3


Despite using hexpat for parsing, hets must construct a tree and still crashes.

sternk commented 10 years ago

Comment by till Migrated from http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248#comment:4


OK, so hexpat seems to be a good choice. Why does Hets crash nevertheless? What exactly happes? Does Hets use hexpat already?

sternk commented 10 years ago

Comment by maeder Migrated from http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248#comment:5


hets uses hexpat since spechub/Hets@ca12babf0d23c4aa3c5d4d29533e306c4582f250.

sternk commented 10 years ago

Comment by maeder Migrated from http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248#comment:6


I've added some optimizations, but the largest ontology NIFSTD.owl still needs 16,5g memory and about 10 minutes. (The java part needs 5g and 3 minutes.)

sternk commented 10 years ago

Comment by maeder Migrated from http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248#comment:7


The above numbers have been measured without writing the .xml output file. I've aborted the latter after 72 minutes and 32g memory usage.