I'm currently changing my fork [1] of the DBpedia extraction
framework[2] to use the sweble parser instead of a running mediawiki
instance for extracting the abstracts of each wiki page.
What I noticed is a difference when the page contains a thumb image at
the beginning.
The HTML output of sweble is nearly fine, but the following wiki text is
not surrounded with a html paragraph tag (
) any more.
This is currently required by the extraction framework [3].
A minimal maven example is created (parsingThumbImages.zip).
If thumb is removed from this media wiki markup
"[[File:Example.jpg|thumb]]"
the overall text is in paragraph tags, otherwise not.
Do I have to change some WikiConfig settings?
(I already tried the auto correct feature)
Or is the output intended?
(I also tried the parsoid parser [4]. With this parser the text is
always surrounded by paragraph tags.)
Hi,
I'm currently changing my fork [1] of the DBpedia extraction framework[2] to use the sweble parser instead of a running mediawiki instance for extracting the abstracts of each wiki page.
What I noticed is a difference when the page contains a thumb image at the beginning. The HTML output of sweble is nearly fine, but the following wiki text is not surrounded with a html paragraph tag (
) any more. This is currently required by the extraction framework [3].
A minimal maven example is created (parsingThumbImages.zip). If thumb is removed from this media wiki markup "[[File:Example.jpg|thumb]]" the overall text is in paragraph tags, otherwise not.
Do I have to change some WikiConfig settings? (I already tried the auto correct feature) Or is the output intended? (I also tried the parsoid parser [4]. With this parser the text is always surrounded by paragraph tags.)
Thanks
Best regards Sven
[1] https://github.com/sven-h/extraction-framework [2] https://github.com/dbpedia/extraction-framework [3] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala#L174 [4] https://www.mediawiki.org/wiki/Parsoid