pladaria / xml-reader

Javascript XML Reader and Parser
MIT License
31 stars 3 forks source link

stream: true not working? #11

Open coaperator opened 2 years ago

coaperator commented 2 years ago

Hello, I am trying to process a 100 megabyte (~ 900 thousand lines in a file) xml file in streaming mode

  const XmlReader = require('xml-reader')

  const file = fs.createReadStream(
    'public/prices/middle-price_list.yml',
    'utf8'
  )

  const parser = XmlReader.create({ stream: true })

  file.on('data', (chunk) => {
    parser.parse(chunk)
  })

  // Tried another option from the documentation
  // file.on('data', (chunk) => {
  //   chunk.split('').forEach((char) => parser.parse(char))
  // })

  parser.on('tag:catalog', (data) => {
    objShop.date = data.attributes.date
  })

  parser.on('tag:shop', (data) => {
    objShop.name = data.children.filter(
      (el) => el.name == 'name'
    )[0]?.children[0]?.value

    objShop.company = data.children.filter(
      (el) => el.name == 'company'
    )[0]?.children[0]?.value

    objShop.url = data.children.filter(
      (el) => el.name == 'url'
    )[0]?.children[0]?.value
  })

  parser.on('tag:offer', (data) => {
    objPrice.push({
      title: data.children.filter((el) => el.name == 'name')[0]?.children[0]
        ?.value,
      url: data.children.filter((el) => el.name == 'url')[0]?.children[0]
        ?.value,
      picture: data.children.filter((el) => el.name == 'picture')[0]
        ?.children[0]?.value,
    })
  })

  parser.on('done', (data) => {
    console.log('objShop', objShop)
    console.log('objPrice', objPrice.length)
  })

The file is processed normally, but during processing, the RAM is occupied by more than 1.5 gigabytes what am I doing wrong?

pladaria commented 2 years ago

Hi, can you provide me with a link to that xml? (shared in drive or similar)

El lun, 16 may 2022 a las 11:48, Алекс @.***>) escribió:

Hello, I am trying to process a 100 megabyte xml file in streaming mode

const XmlReader = require('xml-reader')

const file = fs.createReadStream( 'public/prices/middle-price_list.yml', 'utf8' )

const parser = XmlReader.create({ stream: true })

file.on('data', (chunk) => { parser.parse(chunk) }) // Tried another option from the documentation // file.on('data', (chunk) => { // chunk.split('').forEach((char) => parser.parse(char)) // }) // file.split('').forEach((char) => parser.parse(char))

parser.on('tag:yml_catalog', (data) => { objShop.date = data.attributes.date })

parser.on('tag:shop', (data) => { objShop.name = data.children.filter( (el) => el.name == 'name' )[0]?.children[0]?.value

objShop.company = data.children.filter(
  (el) => el.name == 'company'
)[0]?.children[0]?.value

objShop.url = data.children.filter(
  (el) => el.name == 'url'
)[0]?.children[0]?.value

})

parser.on('tag:offer', (data) => { objPrice.push({ title: data.children.filter((el) => el.name == 'name')[0]?.children[0] ?.value, url: data.children.filter((el) => el.name == 'url')[0]?.children[0] ?.value, picture: data.children.filter((el) => el.name == 'picture')[0] ?.children[0]?.value, }) })

parser.on('done', (data) => { console.log('objShop', objShop) console.log('objPrice', objPrice.length) })

But during processing, the RAM is occupied by more than 1.5 gigabytes what am I doing wrong?

— Reply to this email directly, view it on GitHub https://github.com/pladaria/xml-reader/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAENOWM2DNZVXO6PSOQD7S3VKIKWDANCNFSM5WA4UWVQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

coaperator commented 2 years ago

Hi, can you provide me with a link to that xml? (shared in drive or similar) El lun, 16 may 2022 a las 11:48, Алекс @.***>) escribió:

I can’t give you exactly the one that I used, but you can take this code and iterate the offer block 80-100 thousand times in a loop to get a large file

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE yml_catalog SYSTEM "shops.dtd">
<yml_catalog date="2022-05-14 17:17">
  <shop>
    <name>Test</name>
    <company>Test</company>
    <url>https://test.com/</url>
    <currencies>
      <currency id="USD" rate="1"></currency>
    </currencies>
    <delivery-options>
      <option cost="390" days="0-1"></option>
    </delivery-options>
    <categories>
      <category id="1">Test category</category>
    </categories>
    <offers>
      <offer id="832599" available="true">
        <url>https://test.com/intel_core_i9-11900_832599.html</url>
        <price>37677</price>
        <currencyId>USD</currencyId>
        <categoryId>1</categoryId>
        <picture>https://img.test.com/82/b4/6065c6482b404261875729_500.jpg</picture>
        <pickup>true</pickup>
        <delivery>true</delivery>
        <delivery-options>
          <option cost="390" days="0" order-before="11"></option>
        </delivery-options>
        <name>Intel Core i9-11900 CM8070804488245 Rocket Lake 8C/16T 2.5-5.3GHz</name>
        <vendor>Intel</vendor>
        <model>Core i9-11900</model>
        <vendorCode>CM8070804488245</vendorCode>
        <description>Rocket Lake 8C/16T 2.5-5.3GHz (LGA1200, L3 16MB, 14nm, UHD Graphics 750 1.3GHz, 65W)</description>
        <sales_notes>test</sales_notes>
        <manufacturer_warranty>true</manufacturer_warranty>
        <cpa>1</cpa>
        <weight>0.1</weight>
      </offer>
    </offers>
  </shop>
</yml_catalog>
lucasrendo commented 1 year ago

Hi I'm dealing with the same issue. I have a 740 MB xml (confidential information) and I'm getting "FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory"

This is my parser code image

I have checked byte size of every single variable but all of them are correctly garbage collected after each iteration. I've also commented out parts of the code in case something other than the parser was triggering the error but the code only runs when i comment out the entire parser, meaning parser.parse

I've tried looking at the source code of both the reader and the lexer but I can't seem to figure out what's causing the issue