outofcontrol / mediawiki-to-gfm

Converts Mediawiki format to Github Flavoured Markdown format
85 stars 21 forks source link

Converting huge files to markdown fails with "Huge input lookup" #24

Closed trancexpress closed 1 year ago

trancexpress commented 1 year ago

I'm trying to convert the entire Eclipse wiki exported to a single xml file, one of the errors I got was apparently due to too big file:

Warning: SimpleXMLElement::__construct(): Entity: line 2518746: parser error : internal error: Huge input lookup in /mediawiki-to-gfm/app/src/Convert.php on line 346

This error seems to be gone with this change:

diff --git a/app/src/Convert.php b/app/src/Convert.php
index fc00640..dea6812 100644
--- a/app/src/Convert.php
+++ b/app/src/Convert.php
@@ -343,7 +343,7 @@ class Convert
      */
     public function loadData($xmlData)
     {
-        if (($xml = new \SimpleXMLElement($xmlData)) === false) {
+        if (($xml = new \SimpleXMLElement($xmlData,  LIBXML_PARSEHUGE)) === false) {
             throw new \Exception('Invalid XML File.');
         }
         $this->dataToConvert = $xml->xpath('page');

I invoke the converter with:

php -d memory_limit=4G ./convert.php --filename=/wikiexport/Eclipsepedia-20230228212752.xml --output=/wikiexport/markdown/

The xml file is about 130 MB. Since this is an extreme case (from my POV), I'm not sure fixing the bug is worth the effort.

outofcontrol commented 1 year ago

I've patched Convert.php with this fix. Thank you @trancexpress

trancexpress commented 1 year ago

Thank you very much!