soerenmeier / parse-wiki-text-2

MIT No Attribution
5 stars 5 forks source link

XML/HTML entites affecting node detection #6

Closed Caellian closed 5 months ago

Caellian commented 5 months ago

I'm trying to parse Wikipedia dump. On the first article ("Anarchism"), the parser fails to detect comments because < & > are escaped:

&lt;!-- Attention! The external link portion [...] free of clutter. --&gt;

Not sure whether you consider this a bug because this only happens due to inclusion of wikitext in mediawiki XML format - these characters have to be escaped or wrapped in CDATA as they're included in XML.

I'm reporting this anyway, maybe there could be a configuration option to try resolving character entities before trying to parse tags to deal with this somewhat common use case (dump scraping).

Caellian commented 5 months ago

A similar issue occurs on the same article with &amp;nbsp;, so I'll assume I'm supposed to process entities before passing the data to parse-wiki-text-2.