prewk / xml-string-streamer

Stream large XML files with low memory consumption.
MIT License
356 stars 49 forks source link

Using tags inside CDATA breaks the StringWalker #63

Closed LilyBergonzat closed 5 years ago

LilyBergonzat commented 5 years ago

I'm trying to load a simple XML code that has HTML in one of its tags. My code is pretty straightforward.

<?php

use Prewk\XmlStringStreamer;
use Prewk\XmlStringStreamer\Parser\StringWalker;
// This class is like the File Streamer but takes strings
use Absolunet\Bundle\MiddlewareBundle\Stream\Raw as RawStream;

$faultyChunk = '<container>
<attr>
    <attrCode>Texte_marketing_long</attrCode>
    <value language="fr">
        <stringValue>
            <![CDATA[
                <p>test</p>
            ]]>
        </stringValue>
    </value>
    <value language="en">
        <stringValue>
            <![CDATA[
                <p>test</p>
            ]]>
        </stringValue>
    </value>
</attr></container>';

$stream = new RawStream($faultyChunk, 16384);
$parser = new StringWalker();
$streamer = new XmlStringStreamer($parser, $stream);

$node = $streamer->getNode();
$xmlObj = simplexml_load_string($node);

And I end up getting a warning: Premature end of data in tag attr line 2

This is because the walker parses the HTML that is inside the CDATA tags, and considers the "node" is finished before adding the </attr> tag. If I set "expectGT" to true, it works correctly, but I feel like I shouldn't have to do that since the HTML code is inside CDATA tags and shouldn't be parsed.

What do you think? Thank you

prewk commented 5 years ago

Hello! I try to explain why you have to set it in the README:

You can allow the > character within XML comments and CDATA sections if you want. This is pretty uncommon, and therefore turned off by default for performance reasons.

So it's purely to speed things up. The parsers aren't very clever, they're just supporting the least minimum effort to parse most XML documents which works for most use cases. Its CDATA logic is pretty dumb and there are probably edge cases not supported at all.

If you're looking around for incremential xml parsers you might want to check out this thing I've found as well: https://github.com/TBPixel/xml-streamer

Haven't tried it but looks a bit more modern than mine. Good luck!