Address memory leak when exporting annotations to CSV

nathanielrindlaub commented 2 months ago

When exporting a large number of annotations to CSV (e.g. >286k, as is the case currently with SCI Biosecurity), the task Lambda exhausts it's 1024MB of memory. This shouldn't be the case as we're streaming records from the DB, transforming them, and then streaming the CSV rows to S3, but evidently something is not working as intended.

The memory leak does not occur when exporting the exact same annotations to COCO. I logged the process.memoryUsage() for both processes while the annotations were being streamed in and written to S3 (at intervals of 1000 images), and these were the results when exporting to CSV (note: final rss was 985MB and the memory was exhausted so it never completed):

flattened img count: 1000. remaining memory:
{ "rss": 137,121,792, "heapTotal": 73,748,480, "heapUsed": 50,899,840, "external": 58805150, "arrayBuffers": 55131016 }
...
flattened img count: 40000. remaining memory:
{ "rss": 392,052,736, "heapTotal": 330,014,720, "heapUsed": 231,993,000, "external": 70528828, "arrayBuffers": 66854654 }
...
flattened img count: 190000. remaining memory:
{ "rss": 1,033,187,328, "heapTotal": 983,359,488, "heapUsed": 846,987,352, "external": 87,847,040, "arrayBuffers": 84,172,866 }

And here were the results when performing a multithreaded COCO export (note: final rss was 341MB and the export completed):

processed img count: 1000. remaining memory:
{ "rss": 114,442,240, "heapTotal": 48320512, "heapUsed": 39098304, "external": 59472021, "arrayBuffers": 55797887 }
...
processed img count: 40000. remaining memory:
{ "rss": 299,696,128, "heapTotal": 181751808, "heapUsed": 64955856, "external": 90414972, "arrayBuffers": 86740798 }
...
processed img count: 190000. remaining memory:
{ "rss": 352,382,976, "heapTotal": 184111104, "heapUsed": 82479608, "external": 132587820, "arrayBuffers": 128913646 }
...
processed img count: 286000. remaining memory:
{ "rss": 357,933,056, "heapTotal": 190664704, "heapUsed": 100,292,920, "external": 131276898, "arrayBuffers": 127602724 }
...
REPORT RequestId: 2354a34d-e251-5acb-a1be-fdc7fa919462 Duration: 52808.03 ms Billed Duration: 52809 ms Memory Size: 1024 MB Max Memory Used: 385 MB

nathanielrindlaub commented 2 months ago

This looks like it could be a backpressure issue. The CSV export requires piping together 2 additional transform streams (our custom flattenImgTransform and csv-stringify's stringify transform) before writing to the streamToS3 stream.

The COCO export, on the other hand, streams in the image records and writes the directly to streamToS3 streams without any transforms in between.

alukach commented 2 months ago

It's been a while since I've put on my "thinking in streams" hat, but this portion of the codebase stands out to me:

https://github.com/tnc-ca-geo/animl-api/blob/34b27f4bb5da5f022a6d101031c670ae0a22801f/src/task/annotations.js#L91-L95

I think we may be filling up the stream with images before we start consuming from it.

Would something like the following work? ie connect all the streams before we start pushing images into the flattenImg transform?

const { streamToS3, promise } = this.streamToS3(this.filename);

const streams = stream.pipeline(
  flattenImg,
  createRow,
  streamToS3
);

// stream in images from MongoDB, write to transformation stream
for await (const img of Image.aggregate(this.pipeline)) {
  flattenImg.write(img);
}
flattenImg.end();

// pipe together transform and write streams
await Promise.all([streams, promise]);

ps this is a great ticket, I appreciate all of the context

nathanielrindlaub commented 2 months ago

@alukach yeah you and me both. Streams are decidedly not my strong suit.

That's an interesting theory I can test pretty easily buy adding a logging transform right after the flattenImg transform, and see if that fires as flattenImg is working or whether it waits until it's flattened all of the returned images.

I'll let you know how it goes.

nathanielrindlaub commented 2 months ago

@alukach I think you were right FYI - I still don't totally understand what was going on with that async iterator in relation to the stream.pipeline, but evidently that set up was not correct. I did some logging and it wasn't actually streaming the image records at all; all of the image records were pooling up and getting processed in entirety by each stream before getting fed into the next one in the pipeline.

Anyhow, figured out a fix for it so we should be good now. Thanks for your help!

tnc-ca-geo / animl-api

Address memory leak when exporting annotations to CSV #175