remuslazar / node-xmlsplit

Split large XML files into smaller chunks, uses Node.js Stream API
MIT License
18 stars 6 forks source link

Splitting on tags more than 1 level deep confuses xmlsplit #9

Open BertCatsburg opened 3 years ago

BertCatsburg commented 3 years ago

XML File

I have the following small XML file

<?xml version="1.0" encoding="utf-8"?>
<Outer>
    <Inner attr="xxx">
        <A>1</A>
    </Inner>
    <Inner otherattr="yyy">
        <A>2-0</A>
        <A>2-1</A>
        <A>2-2</A>
        <A>2-3</A>
    </Inner>
    <Inner>
        <A>
            <B attr="AA"/>
            <C>
                <D Dattr="Value"/>
            </C>
        </A>
    </Inner>
</Outer>

Program

And the following file

import fs from 'fs';

const XmlSplit = require('xmlsplit');

const xmlsplit = new XmlSplit(1, 'A'); // Splitting on Tag <A>

const CHUNK_SIZE = 200; // bytes

const xmlfile = 'Test.xml';

async function start() {

    const stream = fs.createReadStream(xmlfile, { highWaterMark: CHUNK_SIZE});
    stream.pipe(xmlsplit).on('data', function(data: any) {
        const xmlDocument = data.toString();
        console.log(xmlDocument);
        console.log('--------------------------------------')
    });
}

start();

Expected output

You would expect different XML documents with A-tags, either

<Outer>
    <Inner>
        <A>
            ...
        </A>
    <Inner>
</Outer

or an XML without the Inner tag.

Realized output

But XmlSplit return the following:

<?xml version="1.0" encoding="utf-8"?>
<Outer>
    <Inner attr="xxx">
        <A>1</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
    <Inner attr="xxx">

    </Inner>
    <Inner otherattr="yyy">
        <A>2-0</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
    <Inner attr="xxx">

        <A>2-1</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
    <Inner attr="xxx">

        <A>2-2</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
    <Inner attr="xxx">

        <A>2-3</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
    <Inner attr="xxx">

    </Inner>
    <Inner>
        <A>
            <B attr="AA"/>
            <C>
                <D Dattr="Value"/>
            </C>
        </A></Outer>
--------------------------------------

If you look at the output returned you can see that in several instances the process gets confused.

QAnders commented 2 months ago

Old question, and this is most likely this is not maintained but we actually ran into this error the other day and I set to Google'ing and found this...

I did a "dirty" fix as we don't really care about the nested elements, just to get it split up.

The problem is that the first dataChunk (index = 0) will retain the "parent", e.g.:

<Inner attr="xxx">
    <A>1</A>

While the following dataChunk parts will be "clean":

<A>1</A>

When this is pieced together again it'll include <Inner attr="xxx"> from the first dataChunk but never close it. As we don't care about the <Inner attr="xxx"> element I just added a fix on the next line of this: https://github.com/remuslazar/node-xmlsplit/blob/7a7e081c226ebe0577b35743fd40b22064f621ee/lib/xmlsplit.js#L83

By:

        dataChunks.forEach(function (data, index) {
          const tagChk = new RegExp(`^<${this._tagName}[\\S|>]{1}`);
          if (tagChk.test(data)) {
            // eslint-disable-next-line no-param-reassign
            data = data.slice(data.match(tagChk).at(0)?.length);
          }
          dataChunk += data;

This will strip away the element Inner from the first dataChunk data making the resulting XML valid.

remuslazar commented 2 months ago

@QAnders looks good to me. Could you please open a pull request with the changes above? Thanks!

QAnders commented 2 months ago

Sure, I can do that, @remuslazar , but I need write access then to the repo... :)

I've reworked it a bit so that it does include the element as well now...

remuslazar commented 2 months ago

@QAnders you can fork this repo and create the PR which I can then merge later on (having write access).

QAnders commented 2 months ago

PR open @remuslazar https://github.com/remuslazar/node-xmlsplit/pull/10