I'm trying to parse Wikipedia dump. On the first article ("Anarchism"), the parser fails to detect comments because < & > are escaped:
<!-- Attention! The external link portion [...] free of clutter. -->
Not sure whether you consider this a bug because this only happens due to inclusion of wikitext in mediawiki XML format - these characters have to be escaped or wrapped in CDATA as they're included in XML.
Some article on some XML format might include <!-- as part of actual content in order to showcase XML comments.
I'm reporting this anyway, maybe there could be a configuration option to try resolving character entities before trying to parse tags to deal with this somewhat common use case (dump scraping).
A similar issue occurs on the same article with &nbsp;, so I'll assume I'm supposed to process entities before passing the data to parse-wiki-text-2.
I'm trying to parse Wikipedia dump. On the first article ("Anarchism"), the parser fails to detect comments because
<
&>
are escaped:Not sure whether you consider this a bug because this only happens due to inclusion of wikitext in mediawiki XML format - these characters have to be escaped or wrapped in CDATA as they're included in XML.
<!--
as part of actual content in order to showcase XML comments.I'm reporting this anyway, maybe there could be a configuration option to try resolving character entities before trying to parse tags to deal with this somewhat common use case (dump scraping).