juanjoDiaz / json2csv

Flexible conversion between JSON and CSV
https://juanjodiaz.github.io/json2csv/
MIT License
296 stars 32 forks source link

Out-of-memory-error #38

Open pupixipup opened 1 year ago

pupixipup commented 1 year ago

I am facing a weird issue when parsing json to csv in aws lambda.

Works perfectly with smaller (up to 20mb) files, but fails on +50mb, even though 800mb+ ram is dedicated

The code consists of few things: receiving an http stream -> piping it to json parser -> resolving the promise. Looks very simply, but crashes with out of memory error. Here are the details:

import { isHttps } from './utils'; import { Transform as CsvTransform } from '@json2csv/node'; import flatten from './parsers/flatten';

// Function utilizes streams to read data from a url, transform it, and upload it to s3

export const lambdaHandler = async (event: APIGatewayProxyEvent): Promise => { try { const data = typeof event === 'string' ? JSON.parse(event) : event; const url = data.message.extsrc; // const body = await streamResponse(url); const protocol = isHttps(url) ? https : http; const body = await new Promise((resolve, reject) => { protocol.get(url, (networkStream: http.IncomingMessage) => { pipeline( networkStream, new CsvTransform( { transforms: [flatten({ objects: true, arrays: true, separator: '__' })], }, { objectMode: true }, ), (error: any) => { if (error) { reject(error); } resolve('success'); }, ); }); }); const response = { statusCode: 200, body: JSON.stringify(body), }; return response; } catch (err) { console.log(err); return { statusCode: 500, body: JSON.stringify({ message: err, }), }; } };

4. JSON data: https://data.wa.gov/api/views/f6w7-q2d2/rows.json?accessType=DOWNLOAD *OR* https://raw.githubusercontent.com/seductiveapps/largeJSON/master/100mb.json (both fail).
5. Output/error: 

<--- Last few GCs --->

[14:0x55dad147c790] 13746 ms: Mark-sweep (reduce) 713.6 (727.9) -> 713.6 (727.9) MB, 485.7 / 0.0 ms (average mu = 0.153, current mu = 0.001) last resort; GC in old space requested [14:0x55dad147c790] 14237 ms: Mark-sweep (reduce) 713.6 (727.9) -> 713.6 (727.9) MB, 490.8 / 0.0 ms (average mu = 0.080, current mu = 0.000) last resort; GC in old space requested

<--- JS stacktrace --->

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory 1: 0x55dacc879103 node::Abort() [/var/lang/bin/node] 2: 0x55dacc74df95 [/var/lang/bin/node] 3: 0x55daccab34bd v8::Utils::ReportOOMFailure(v8::internal::Isolate, char const, bool) [/var/lang/bin/node] 4: 0x55daccab37c2 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate, char const, bool) [/var/lang/bin/node] 5: 0x55dacccd6542 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/var/lang/bin/node]
6: 0x55dacccb024e v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/var/lang/bin/node] 7: 0x55daccca6b7b v8::internal::FactoryBase::AllocateRawArray(int, v8::internal::AllocationType) [/var/lang/bin/node] 8: 0x55daccca6d43 v8::internal::FactoryBase::NewFixedArrayWithFiller(v8::internal::Handle, int, v8::internal::Handle, v8::internal::AllocationType) [/var/lang/bin/node] 9: 0x55daccfd5760 v8::internal::Handle v8::internal::HashTable<v8::internal::NameDictionary, v8::internal::NameDictionaryShape>::NewInternal(v8::internal::Isolate, int, v8::internal::AllocationType) [/var/lang/bin/node] 10: 0x55daccfd5c24 v8::internal::Handle v8::internal::HashTable<v8::internal::NameDictionary, v8::internal::NameDictionaryShape>::EnsureCapacity(v8::internal::Isolate, v8::internal::Handle, int, v8::internal::AllocationType) [/var/lang/bin/node] 11: 0x55daccfd5c92 v8::internal::Handle v8::internal::Dictionary<v8::internal::NameDictionary, v8::internal::NameDictionaryShape>::Add(v8::internal::Isolate, v8::internal::Handle, v8::internal::Handle, v8::internal::Handle, v8::internal::PropertyDetails, v8::internal::InternalIndex) [/var/lang/bin/node] 12: 0x55daccfd61e9 v8::internal::BaseNameDictionary<v8::internal::NameDictionary, v8::internal::NameDictionaryShape>::Add(v8::internal::Isolate, v8::internal::Handle, v8::internal::Handle, v8::internal::Handle, v8::internal::PropertyDetails, v8::internal::InternalIndex) [/var/lang/bin/node] 13: 0x55daccf8e33c v8::internal::LookupIterator::ApplyTransitionToDataProperty(v8::internal::Handle) [/var/lang/bin/node] 14: 0x55daccfa9c8c v8::internal::Object::TransitionAndWriteDataProperty(v8::internal::LookupIterator, v8::internal::Handle, v8::internal::PropertyAttributes, v8::Maybe, v8::internal::StoreOrigin) [/var/lang/bin/node] 15: 0x55dacd12c365 v8::internal::Runtime::SetObjectProperty(v8::internal::Isolate, v8::internal::Handle, v8::internal::Handle, v8::internal::Handle, v8::internal::StoreOrigin, v8::Maybe) [/var/lang/bin/node] 16: 0x55dacd12dc7a v8::internal::Runtime_SetKeyedProperty(int, unsigned long, v8::internal::Isolate) [/var/lang/bin/node] 17: 0x55dacd5ff639 [/var/lang/bin/node] END RequestId: 4a31198e-c28f-4f01-aaf0-2be6e059d8af REPORT RequestId: 4a31198e-c28f-4f01-aaf0-2be6e059d8af Init Duration: 0.13 ms Duration: 14365.91 ms Billed Duration: 14366 ms Memory Size: 812 MB Max Memory Used: 812 MB

juanjoDiaz commented 1 year ago

Hi @pupixipup,

This is odd indeed. I know of people processing gigabytes with this library...

Does this happens only in the lambda or also locally? Does this happen only when using transforms or always?

Isn't networkStream a binary stream? Why are you using { objectMode: true }?

pupixipup commented 1 year ago

Hi @pupixipup,

This is odd indeed. I know of people processing gigabytes with this library...

Does this happens only in the lambda or also locally? Does this happen only when using transforms or always?

Isn't networkStream a binary stream? Why are you using { objectMode: true }?

Hi, thanks for the quick reply!

  1. It works fine locally, because the default memory limit is ~1,5gb. However, once I try to run the file with node --max-old-space-size=512 main.js, I get the very same out-of-memory error, but file I am trying to parse is 100mb big. This is the source code I tried locally (it is similar to lambda one I showed before):
const http = require('http');
const https = require('https');
const { pipeline } = require('stream');
const { isHttps } = require('./utils');

const fs = require('fs');
const { Transform } = require('@json2csv/node');
const { flatten } = require('@json2csv/transforms');
const mb100 = 'https://raw.githubusercontent.com/seductiveapps/largeJSON/master/100mb.json';

const pseudoLambdaHandler = async () => {
    try {
        const url = mb100;
        const protocol = isHttps(url) ? https : http;
        const body = await new Promise((resolve, reject) => {
            protocol.get(url, (networkStream) => {
                pipeline(
                    networkStream,
                    new Transform(),
                    fs.createWriteStream('mb100.csv'),
                    (error) => {
                        if (error) {
                            reject(error);
                        }
                        resolve('success');
                    },
                );
            });
        });
        const response = {
            statusCode: 200,
            body: JSON.stringify(body),
        };
        return response;
    } catch (err) {
        console.log(err);
        return {
            statusCode: 500,
            body: JSON.stringify({
                message: err,
            }),
        };
    }
};

pseudoLambdaHandler().then((res) => {
  console.log(res);
}) 
  1. It fails both with AND without transforms, unfortunately.
  2. Removing objectMode setting does not help.
juanjoDiaz commented 1 year ago

I was thinking that maybe the transform was making the data grow but it doesn't seem to be the problem. I'll profile this and try to find where the leak is. Using the node transform should ensure that memory is bound to the watermark set at the stream...

juanjoDiaz commented 1 year ago

Hi @pupixipup ,

Please read #45, backpressure works on the stream. You might need to apply a highWatermark to solve the issue.

juanjoDiaz commented 1 year ago

Also, your example object doesn't look like an array of objects (as you would expect to convert to CSV). It looks more like a single object with a gigantic string which is what is gathered in memory until it is completed and can be emitted.

Maybe that is the problem?

ashishnetworks commented 1 year ago

In my case similar error is happening and I'm using read and write both streams. But yeah as you said it seems, the JSON object is actually a big single row object and inside that one of the key has array of row object. Can that be an issue? if yes is there any way to resolve it?

juanjoDiaz commented 1 year ago

Hi @ashishnetworks ,

The object has to be parsed. If your object is an array of objects, every element is parsed, processed and discarded. So memory footprint should be kept pretty low. If your object is a single huge object or string, then the whole object has to be parsed in memory before being processed.