prewk / xml-string-streamer

Stream large XML files with low memory consumption.
MIT License
356 stars 49 forks source link

Infinite loop issue #51

Closed Adraesh closed 7 years ago

Adraesh commented 7 years ago

Hi,

The loop never ends.

$stream = new Stream\File($file, 1024);
$parser = new Parser\UniqueNode(array('uniqueNode' => $node));

$streamer = new XmlStringStreamer($parser, $stream);

while ($node = $streamer->getNode())
{
        //do some stuff...
}

Thank you in advance.

prewk commented 7 years ago

Hi! Can you provide some XML that causes the infinite loop?

It's kind of hard to debug otherwise, considering you've pasted code from the readme :)

Adraesh commented 7 years ago

Hi!

Of course, my bad...

You can try with those 2 XML: https://drive.google.com/open?id=0BxlGN-YBj-q0UmVYS01yNWc4OW8 https://drive.google.com/open?id=0BxlGN-YBj-q0b2RNaG1mOWtwYW8

Careful the second one is more that 1GB.

Thank you in advance.

prewk commented 7 years ago

Great, thank you! I'll look into it as soon as possible.

prewk commented 7 years ago

The small XML seems to work, are you having trouble with that one as well? I'm having problems with the larger one that I'll look into. The XML itself looks a bit odd.

What nodes are you interested in? I assumed <hasMember> when I tried it out.

This works for the small XML:

$stream = new Stream\File("small.xml", 1024);
$parser = new Parser\UniqueNode(array('uniqueNode' => "hasMember"));

$streamer = new XmlStringStreamer($parser, $stream);

$counter = 1;

while ($node = $streamer->getNode())
{
    preg_match('/gml:id="([a-z0-9.-]+)"/', $node, $matches);

    if (isset($matches[1])) {
        echo "$counter: $matches[1]\n";
    } else {
        echo "$counter: ?\n";
    }

    $counter++;
}
Adraesh commented 7 years ago

Hey!

Thank you for the fast review.

Yes I have trouble with the small xml as well.

Try a UniqueNode on the node: "aixm:AirportHeliport", you should have 4232 entries and arrived on the last item (4232) the loop seems to be stuck.

Thank you again.

EDIT: Obviously on "hasMember" node it is also working fine on my side. The problem could come from the namespace when I try to target the node "aixm:AirportHeliport" ?

prewk commented 7 years ago

I think I know why it has problems.. the UniqueNode parser is a bit prone to problems if you've got nodes with the same/similar names in different depths. (aixm:AirportHeliport, aixm:AirportHeliportCollocation)

Try out the StringWalker instead, it seems to work. I chose depth 2 here to represent AIXMBasicMessage > hasMember, you'll have to capture the aixm:AirportHeliport node out of the hasMember in some nifty way. Not all hasMember parent has aixm:AirportHeliport children (they're sometimes aixm:AirportHeliportCollocation).

$stream = new Stream\File("large.xml");
$parser = new Parser\StringWalker(array("captureDepth" => 2));

$streamer = new XmlStringStreamer($parser, $stream);

$counter = 1;

while ($node = $streamer->getNode())
{
    // Help SimpleXML with the namespaces
    $node = str_replace([
        "aixm:",
        "gml:",
        "gmd:",
        "gco:",
        "xlink:",
        "adr:",
        ], [
        "aixm_",
        "gml_",
        "gmd_",
        "gco_",
        "xlink_",
        "adr_",
    ], $node);
    $xml = simplexml_load_string($node);

    if ($xml->aixm_AirportHeliport) {
        echo "$counter " . $xml->aixm_AirportHeliport["gml_id"] . "\n";
    } else {
        echo "$counter\n";
    }

    $counter++;
}
prewk commented 7 years ago

^-After some revisions the code now works. It goes through 530958 hasMember nodes in the 1 GB XML and captures the gml:id attributes on the aixm:AirportHeliport nodes within them.

Note: Only a minority of the nodes in the XML contain a aixm:AirportHeliport node at all. A lot of them seem to contain aixm:AirportHeliportCollocation however.

Adraesh commented 7 years ago

Hey!

So what is the bottom line ? Should I keep my same code?

Thanks.

EDIT: Ok I see your previous code. Why are you replacing the namespace ? I need those namaspace remain the same as the deserialization process it's using it.

prewk commented 7 years ago

The namespace replacements are just a hacky way of getting SimpleXML to be able to read the nodes by themselves. The node is just an XML string, you can do what you want with it, I just wanted to show it was readable :)

Adraesh commented 7 years ago

I just tried with your code and it's working!

So in the mean time the solution for me was:

Using a StringWalker instead of the UniqueNode one.

Thank you very much !

prewk commented 7 years ago

No problem!

I've written an issue regarding the problem.

There is nothing wrong with using the StringWalker instead, it's just different (and a bit slower). It's a bit more resilient.