sweble / sweble-wikitext

The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaWiki.
http://sweble.org/sites/swc-devel/develop-latest/tooling/sweble/sweble-wikitext
70 stars 27 forks source link

Parsing thumb images - no paragraph tags #75

Open sven-h opened 5 years ago

sven-h commented 5 years ago

Hi,

I'm currently changing my fork [1] of the DBpedia extraction framework[2] to use the sweble parser instead of a running mediawiki instance for extracting the abstracts of each wiki page.

What I noticed is a difference when the page contains a thumb image at the beginning. The HTML output of sweble is nearly fine, but the following wiki text is not surrounded with a html paragraph tag (

) any more. This is currently required by the extraction framework [3].

A minimal maven example is created (parsingThumbImages.zip). If thumb is removed from this media wiki markup "[[File:Example.jpg|thumb]]" the overall text is in paragraph tags, otherwise not.

Do I have to change some WikiConfig settings? (I already tried the auto correct feature) Or is the output intended? (I also tried the parsoid parser [4]. With this parser the text is always surrounded by paragraph tags.)

Thanks

Best regards Sven

[1] https://github.com/sven-h/extraction-framework [2] https://github.com/dbpedia/extraction-framework [3] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala#L174 [4] https://www.mediawiki.org/wiki/Parsoid