Memory leak when stream is not read to the end.

erik-induro commented 11 months ago

Environment details

OS: macOS Monterey 12.6.5
Node.js version: v18.16.0
npm version: 9.5.1
@google-cloud/storage version: 6.9.5

Steps to reproduce

The following script reads the first N lines of a file and returns them. I rewrote this same function with multiple 3rd party libraries and other native approaches. The issues is not the lineReader.

  public static readNLines(file: File, numberOfLines: number): Promise<string[]> {
    const readStream = file.createReadStream();
    const lineReader = createInterface({
      input: readStream,
    });

    let lineCounter = 0;
    const lines: string[] = [];

    return new Promise((resolve, reject) => {
      lineReader.on('line', function (line) {
        lineCounter++;
        lines.push(line);
        if (lineCounter === numberOfLines) {
          lineReader.close();
        }
      });
      lineReader.on('error', (e) => {
        readStream.destroy();
        reject(e);
      });
      lineReader.on('close', () => {
        readStream.destroy();
        resolve(lines);
      });
    });
  }

Even though I call destroy() it does not sufficiently clean up the resources. The heap size continues to increase on each execution till the memory on the server pops. I profiled this using node --inspect and saw Typed Arrays increase on each run and never decrease.

My workaround was to specify a smaller portion of the file to pull in, read it to completion, and extract the lines I care about.

// in the real code I check the file size
file.createReadStream({ start: 0, end: 5120 });

When I did this and profiled I saw that Typed Arrays stayed a constant size.

Expected I expect destroy() to clean up resources for the stream when it is not read to the end so that it does not cause a memory leak.

ddelgrosso1 commented 11 months ago

Thanks for opening an issue @erik-induro. A few followup questions when you say 'pop' are you seeing an OOM problem? How large are these files you are reading and what is the typical number of lines you are reading? What are the resources of the underlying machine you are running this on?

Edit: Have you tried turning off validation when creating the read stream and see if that changes what you are seeing? When using a ranged read like your second example, validation is automatically disabled.

erik-induro commented 11 months ago

@ddelgrosso1 When I say 'pop' I just mean that it got it a weird state and wouldn't report telemetry anymore and was fairly unresponsive and new instances would spin up and I have to manually terminate it. We are running in app engine. Hard to describe but it isn't particularly relevant as the symptom could change depending on how people are using the library.

The machine has 4cpu and ~ 26 GB of ram but the machine dies at ~ 5 GB. We use this flag --max_old_space_size=24576. The files are provided by clients so they can range from 1 line to 1 MM lines. Most of the files aren't that large. The issue is mostly caused by the quantity (~3k/night === 1GB increase).

I have not tried turning off validation. I can give that a shot.

erik-induro commented 11 months ago

@ddelgrosso1 I tested turning off validation with the original approach and still saw the memory grow and not release.

ddelgrosso1 commented 11 months ago

@erik-induro thanks for the update. I will dig into this and try and figure out what is going on.

ddelgrosso1 commented 11 months ago

Dug into this a bit today and was not successful in recreating the behavior. First, I created a file containing 750,000 lines and uploaded it to my bucket. I ran the script below which reads a random number of lines on each iteration. I collected metrics utilizing clinic doctor. The heap allocation graph is below. Looks like it grows and drops back instead of growing in an unconstrained fashion.

@erik-induro is there anything else you might be able to provide to recreate the behavior you are seeing?

import { Storage } from "@google-cloud/storage";
import readline from 'readline';

const s = new Storage({projectId: 'my-project'});
const b = s.bucket('my-bucket');
const f = b.file('really-large-txt');

function readLines(file, numLines) {
    const readStream = file.createReadStream();
    const lineReader = readline.createInterface(readStream);
    let count = 0;
    const lines = [];
    return new Promise((res, rej) => {
        lineReader.on('line', line => {
            count++;
            lines.push(line);
            if (count >= numLines) {
                lineReader.close();
            }
        });
        lineReader.on('error', e => {
            readStream.destroy();
            rej(e);
        });
        lineReader.on('close', () => {
            readStream.destroy();
            res(lines);
        })
    });
}

async function testMemory() {
    for(let i = 0; i < 500; i++) {
        console.log(`Iteration: ${i}`);
        const rand = Math.floor(Math.random() * 750001);
        const r = await readLines(f, rand);
    }
    console.log('Done!');
}

await testMemory();

erik-induro commented 11 months ago

@ddelgrosso1 thanks for taking a look at it. Also I like that clinic doctor tool. I'm going to have to start using that :)

I copied your test and ran it locally. I am using a ~17k line file. I removed the rand part and just set it to 2 since that is what I am doing in my use-case (I have to read enough to know how the file is delimited and if it has more than just a header).

I think it has less to do with the number of lines read and more to do with just opening and not reading to completion. I wonder if your spikes and drops are based on the lines being read into memory and that array being dumped and that is distracting from the slow growth that still appears in the baseline of the test.

I ran the test for 2k iterations (which took a hot minute btw...). Here is the chart from that run:

ddelgrosso1 commented 11 months ago

@erik-induro thanks for the update. I will modify and try your suggestion tomorrow and let you know what turns up.

ddelgrosso1 commented 10 months ago

I believe I know what is going on but still trying to definitively confirm with further tests. When .destroy() is called on the only partially consumed stream it is causing an ERR_STREAM_PREMATURE_CLOSE to get propagated. Since under the hood we are using a pipeline this may be causing leaking event handlers.

ddelgrosso1 commented 10 months ago

So my previous comment was partially wrong. In this situation when the stream isn't fully consumed and is destroyed it does indeed cause an ERR_STREAM_PREMATURE_CLOSE error to get propagated. The streams in the chain were getting destroyed correctly. However, I noticed that the underlying connections weren't getting closed. I should have noticed this from the clinic doctor graph. When running 2000 iterations suddenly there were 2000 active handles.

I believe that this is the same issue raised here https://github.com/node-fetch/node-fetch/issues/1762. In order to fix this I added code to the pipeline handler in createReadStream to destroy the HTTP / HTTPS agent.

After adding the code to destroy the agent I ran another 2000 iteration test.

Now both memory and active handles look much more reasonable.

googleapis / nodejs-storage

Memory leak when stream is not read to the end. #2313

Environment details

Steps to reproduce