prewk / xml-string-streamer

Stream large XML files with low memory consumption.
MIT License
356 stars 49 forks source link

Unique node from resource #42

Closed ronnievisser closed 8 years ago

ronnievisser commented 8 years ago

Hi,

I Have a file which has to be parsed in pieces. Currently everytime I load from file url. So it has to be loaded everytime.

Is it possible to load from resource? looking at the code this seems to be possible but when using the Stream\File it tells me "expects parameter 1 to be resource, object given".

How can I load from resource???

prewk commented 8 years ago

Hi!

  1. What requirements do you have? When you say that it has to be parsed in pieces, and you use a file url, it sounds like you're loading a large file over HTTP from a web url, is this correct?
  2. What are you passing to Stream\File exactly? If it says "object given" it means you're not actually passing a resource.
ronnievisser commented 8 years ago

I am parsing CAMT053 with it. A CAMT053 file has an Ntry node which in my case can have over 10000 NtryDtls child nodes. The Ntry node has information about totals and the NtryDtls about transactions. I need info from both the Ntry and the NtryDtls nodes so I use createUniqueNodeParser functionality.

I use the following code to parse Ntry: $streamer = XmlStringStreamer::createUniqueNodeParser( $event->localFilename, [ 'uniqueNode' => 'NtryDtls' ] ); and to parse the balance inf from the Ntry I use: $streamerBal = XmlStringStreamer::createUniqueNodeParser( $event->localFilename, [ 'uniqueNode' => 'Bal' ] );

If I load the Ntry complete it consumes over 700mb loading the node with simplexml_load_string.

The Stream\File is tried to use like this but fails passing to the unique method: Stream\File( $event->localFilename ); I ws hoping I could pass this instead of the url string to the unique method.

prewk commented 8 years ago

You have the actual XML in a file, yes?

I tried successfully to parse NtryDtls from a CAMT053 file i found here like this:

<?php
require_once("vendor/autoload.php");

$streamer = Prewk\XmlStringStreamer::createUniqueNodeParser("./FI_camt_053_sample.xml", array("uniqueNode" => "NtryDtls"));

while ($node = $streamer->getNode()) {
    // This XML node will just be the NtryDtls and it won't eat up your memory because it is overwritten on every iteration
    $xml = simplexml_load_string($node);

    // Access the node as usual
    foreach ($xml->TxDtls as $TxDtl) {
        if (isset($TxDtl->AmtDtls->TxAmt->Amt)) {
            echo "Amount: " . (string)$TxDtl->AmtDtls->TxAmt->Amt . "\n";
        }
    }
}

The same goes for Bal:

<?php

require_once("vendor/autoload.php");

$streamer = Prewk\XmlStringStreamer::createUniqueNodeParser("./FI_camt_053_sample.xml", array("uniqueNode" => "Bal"));

while ($node = $streamer->getNode()) {
    // This XML node will just be the Bals and it won't eat up your memory because it is overwritten on every iteration
    $xml = simplexml_load_string($node);

    // Access the node as usual
    echo "Amount: " . (string)$xml->Amt . "\n";
}
ronnievisser commented 8 years ago

That is exactly how I do it. but i thought maybe I could re-use the resource from the first so it doens't have to load the 30 mb file for 2 times...

ronnievisser commented 8 years ago

how would you go with the following.

In my case I need the BookgDt and the ValDt From the Ntry node. In my file the Ntry node has about 10000 NtryDtls nodes (with all childs). If I load the Ntry with UniqueNode it is consuming 700MB.

Your file has 1 NtryDtls per Ntry.

prewk commented 8 years ago

That is exactly how I do it. but i thought maybe I could re-use the resource from the first so it doens't have to load the 30 mb file for 2 times...

Oh okay. I don't think there is any efficiency in re-using it, really. The CPU cycles will go into parsing anyway, and the memory consumption is low.

If the XML file is 30 MB it won't actually use 30 MBs of memory, it'll stream the file bit by bit and forget about the old iteration on the next.

how would you go with the following. In my case I need the BookgDt and the ValDt From the Ntry node. In my file the Ntry node has about 10000 NtryDtls nodes (with all childs). If I load the Ntry with UniqueNode it is consuming 700MB.

Hm, that's weird. It's been a while but I'm pretty sure the UniqueNode parser by itself won't start saving anything at all before finding the first node, ergo it would only keep at most one node in memory at a time.

Are you doing anything in the while loop that saves data to the outside? Like, saves the whole node or something. Can you show me the code in the while loop that you're using?

prewk commented 8 years ago

Try this (Note: Using the StringWalker instead of UniqueNode):

<?php
require_once("vendor/autoload.php");

$streamer = Prewk\XmlStringStreamer::createStringWalkerParser("./FI_camt_053_sample.xml", array("captureDepth" => 4, "expectGT" => true));

while ($node = $streamer->getNode()) {
    $xml = simplexml_load_string($node);
    $nodeName = $xml->getName();
    if ($nodeName === "Ntry") {
        // Do something with Ntry
        echo "Ntry->Amt: " . (string)$xml->Amt . "\n";
    } else if ($nodeName === "Bal") {
        // Do something with Bal
        echo "Bal->Amt: " . (string)$xml->Amt . "\n";
    }
}
prewk commented 8 years ago

If you don't expect any XML comments with tags in the XML you can skip the expectGT option and you might save some CPU cycles. The example file I linked earlier is full of exampes such as <!-- Recommendation: Use <Foo>blabla</Foo> --> etc so I needed it.

ronnievisser commented 8 years ago

When debugging before and after the simplexml_load_string I see before 8mb and after 713mb.

Since the Ntry node has so many child and it loads into simplexml object I believe it consumes so many memory. Parsing it into different pieces takes only 35mb for the same process.

prewk commented 8 years ago

Since this is a format I'm unfamiliar with, I can only go on the linked example XML, and it looks like this:

            <Ntry>
                <!-- Transaction 1 as an sample of SALA batch  with elements filled both for PMJ-salaries as well as SCT SALA-->
                <!-- Here only as collection, since in salaries the payment level details are not reported -->
                <Amt Ccy="EUR">1000.12</Amt>
                <CdtDbtInd>DBIT</CdtDbtInd>
                <Sts>BOOK</Sts>
                <BookgDt>
                    <Dt>2009-10-29</Dt>
                </BookgDt>
                <ValDt>
                    <Dt>2009-10-29</Dt>
                </ValDt>
                <!-- In case of separate Salary debit report (camt.054) is generated the banks' reference has to be in it as one matching term-->
                <AcctSvcrRef>091029ACCTSTMTARCH01</AcctSvcrRef>
                <BkTxCd>
                    <!-- In case of PMJ salaries as in the sample.  In case of SCT SALA PMNT/ICDT/ESCT + PurposeCode SALA) -->
                    <Domn>
                        <Cd>PMNT</Cd>
                        <Fmly>
                            <Cd>ICDT</Cd>
                            <SubFmlyCd>SALA</SubFmlyCd>
                        </Fmly>
                    </Domn>
                    <!-- Prtry used only in case of PMJ-salaries -->
                    <Prtry>
                        <Cd>NTRF+701TransactionCodeText</Cd>
                    </Prtry>
                </BkTxCd>
                <NtryDtls>
                    <Btch>
                        <!-- customer made batch and message-references (not in old TS but yes in SALA SCT in case that pain.001 is used and direct corresponding matching can be found).  Purpose: Reconciiation-->
                        <!-- Basic recommendation: as much as possible of the original payment instruction material that came from the customer into the bank-->
                        <MsgId>MSGSALA0001</MsgId>
                        <!-- in LM-batches this is an info given in the batch record and supported by most of the banks as the initiator batch level identification-->
                        <PmtInfId>CustRefForSalaBatch</PmtInfId>
                        <!-- customer made batch's transaction total by the initiated material.  Purpose: Reconciiation-->
                        <NbOfTxs>4</NbOfTxs>
                    </Btch>
                    <TxDtls>
                        <!-- used to specify what subtype (purpose code) of SCT SALA (category purpose and notice that tx code in this case is PMNT/ICDT/ESCT) debtor has used.  Not so critical on debtor stmts but on creditor it  is -->
                        <Purp>
                            <Cd>SALA</Cd>
                        </Purp>
                    </TxDtls>
                </NtryDtls>
            </Ntry>

Are you saying that your Ntry nodes have a lot more children than that node?

Or are you saying that you are trying to simplexml_load_string on the whole document? The point of XmlStringStreamer is to allow you to simplexml_load_string every node individually and then forgetting about it in the next iteration. That's why the memory consumption is low.

If you, however, save every simplexml object into an array outside of your while loop on every iteration or something, you will lose the benefit. Is this what you're doing?

It's hard to make assumptions when I don't have your XML or code to look at (I understand the XMLs are sensitive information), but using the example CAMT053 example XML I don't see why it would chew up 700 MB of memory, really.

ronnievisser commented 8 years ago
                <!-- Transaction 1 as an sample of SALA batch  with elements filled both for PMJ-salaries as well as SCT SALA-->
                <!-- Here only as collection, since in salaries the payment level details are not reported -->
                <Amt Ccy="EUR">1000.12</Amt>
                <CdtDbtInd>DBIT</CdtDbtInd>
                <Sts>BOOK</Sts>
                <BookgDt>
                    <Dt>2009-10-29</Dt>
                </BookgDt>
                <ValDt>
                    <Dt>2009-10-29</Dt>
                </ValDt>
                <!-- In case of separate Salary debit report (camt.054) is generated the banks' reference has to be in it as one matching term-->
                <AcctSvcrRef>091029ACCTSTMTARCH01</AcctSvcrRef>
                <BkTxCd>
                    <!-- In case of PMJ salaries as in the sample.  In case of SCT SALA PMNT/ICDT/ESCT + PurposeCode SALA) -->
                    <Domn>
                        <Cd>PMNT</Cd>
                        <Fmly>
                            <Cd>ICDT</Cd>
                            <SubFmlyCd>SALA</SubFmlyCd>
                        </Fmly>
                    </Domn>
                    <!-- Prtry used only in case of PMJ-salaries -->
                    <Prtry>
                        <Cd>NTRF+701TransactionCodeText</Cd>
                    </Prtry>
                </BkTxCd>
                <NtryDtls>
                    <Btch>
                        <!-- customer made batch and message-references (not in old TS but yes in SALA SCT in case that pain.001 is used and direct corresponding matching can be found).  Purpose: Reconciiation-->
                        <!-- Basic recommendation: as much as possible of the original payment instruction material that came from the customer into the bank-->
                        <MsgId>MSGSALA0001</MsgId>
                        <!-- in LM-batches this is an info given in the batch record and supported by most of the banks as the initiator batch level identification-->
                        <PmtInfId>CustRefForSalaBatch</PmtInfId>
                        <!-- customer made batch's transaction total by the initiated material.  Purpose: Reconciiation-->
                        <NbOfTxs>4</NbOfTxs>
                    </Btch>
                    <TxDtls>
                        <!-- used to specify what subtype (purpose code) of SCT SALA (category purpose and notice that tx code in this case is PMNT/ICDT/ESCT) debtor has used.  Not so critical on debtor stmts but on creditor it  is -->
                        <Purp>
                            <Cd>SALA</Cd>
                        </Purp>
                    </TxDtls>
                </NtryDtls>
<NtryDtls>
                    <Btch>
                        <!-- customer made batch and message-references (not in old TS but yes in SALA SCT in case that pain.001 is used and direct corresponding matching can be found).  Purpose: Reconciiation-->
                        <!-- Basic recommendation: as much as possible of the original payment instruction material that came from the customer into the bank-->
                        <MsgId>MSGSALA0001</MsgId>
                        <!-- in LM-batches this is an info given in the batch record and supported by most of the banks as the initiator batch level identification-->
                        <PmtInfId>CustRefForSalaBatch</PmtInfId>
                        <!-- customer made batch's transaction total by the initiated material.  Purpose: Reconciiation-->
                        <NbOfTxs>4</NbOfTxs>
                    </Btch>
                    <TxDtls>
                        <!-- used to specify what subtype (purpose code) of SCT SALA (category purpose and notice that tx code in this case is PMNT/ICDT/ESCT) debtor has used.  Not so critical on debtor stmts but on creditor it  is -->
                        <Purp>
                            <Cd>SALA</Cd>
                        </Purp>
                    </TxDtls>
                </NtryDtls>
<NtryDtls>
                    <Btch>
                        <!-- customer made batch and message-references (not in old TS but yes in SALA SCT in case that pain.001 is used and direct corresponding matching can be found).  Purpose: Reconciiation-->
                        <!-- Basic recommendation: as much as possible of the original payment instruction material that came from the customer into the bank-->
                        <MsgId>MSGSALA0001</MsgId>
                        <!-- in LM-batches this is an info given in the batch record and supported by most of the banks as the initiator batch level identification-->
                        <PmtInfId>CustRefForSalaBatch</PmtInfId>
                        <!-- customer made batch's transaction total by the initiated material.  Purpose: Reconciiation-->
                        <NbOfTxs>4</NbOfTxs>
                    </Btch>
                    <TxDtls>
                        <!-- used to specify what subtype (purpose code) of SCT SALA (category purpose and notice that tx code in this case is PMNT/ICDT/ESCT) debtor has used.  Not so critical on debtor stmts but on creditor it  is -->
                        <Purp>
                            <Cd>SALA</Cd>
                        </Purp>
                    </TxDtls>
                </NtryDtls>
<NtryDtls>
                    <Btch>
                        <!-- customer made batch and message-references (not in old TS but yes in SALA SCT in case that pain.001 is used and direct corresponding matching can be found).  Purpose: Reconciiation-->
                        <!-- Basic recommendation: as much as possible of the original payment instruction material that came from the customer into the bank-->
                        <MsgId>MSGSALA0001</MsgId>
                        <!-- in LM-batches this is an info given in the batch record and supported by most of the banks as the initiator batch level identification-->
                        <PmtInfId>CustRefForSalaBatch</PmtInfId>
                        <!-- customer made batch's transaction total by the initiated material.  Purpose: Reconciiation-->
                        <NbOfTxs>4</NbOfTxs>
                    </Btch>
                    <TxDtls>
                        <!-- used to specify what subtype (purpose code) of SCT SALA (category purpose and notice that tx code in this case is PMNT/ICDT/ESCT) debtor has used.  Not so critical on debtor stmts but on creditor it  is -->
                        <Purp>
                            <Cd>SALA</Cd>
                        </Purp>
                    </TxDtls>
                </NtryDtls>
            </Ntry>```

This is how my XML looks like.. I got over 1000 of those NtryDtls. I only load the simplexml inside the while loop
prewk commented 8 years ago

Alright, and you got the memory issue using createStringWalkerParser (with expectGT set to true as in my example above)?

ronnievisser commented 8 years ago

No, Using the createUniqueNodeParser

prewk commented 8 years ago

See my example above where I'm using the StringWalker, please. It might solve your problem.

frederikbosch commented 8 years ago

@RonnieVisser With what solution did you came up? Did you use this library? Maybe I can try to add it to our CAMT library. We also see problems when the XML gets bigger.

prewk commented 8 years ago

@frederikbosch If you can provide me with an XML that uses a lot of memory I can probably provide a solution using this library.

frederikbosch commented 8 years ago

@prewk Can you tell me what this library offers over the XMLReader extension?

prewk commented 8 years ago

Probably mostly ease of use. XMLReader, could be argued, needs pretty specific implementations for every different XML document etc whereas you just feed my library with an XML and it usually "just works".

Although, it must be said, I have very little experience with the XMLReader extension, so my lack of insight may cause a bias.

ronnievisser commented 8 years ago

I have fixed my issues with the createStringWalkerParser. Doesn't matter the size of the XML it is using no more then 6mb :):)

frederikbosch commented 8 years ago

@RonnieVisser Sounds good. Will investigate how to use it within the CAMT package.