Closed nathanielrindlaub closed 2 months ago
This looks like it could be a backpressure issue. The CSV export requires piping together 2 additional transform streams (our custom flattenImgTransform
and csv-stringify
's stringify
transform) before writing to the streamToS3
stream.
The COCO export, on the other hand, streams in the image records and writes the directly to streamToS3
streams without any transforms in between.
It's been a while since I've put on my "thinking in streams" hat, but this portion of the codebase stands out to me:
I think we may be filling up the stream with images before we start consuming from it.
Would something like the following work? ie connect all the streams before we start pushing images into the flattenImg
transform?
const { streamToS3, promise } = this.streamToS3(this.filename);
const streams = stream.pipeline(
flattenImg,
createRow,
streamToS3
);
// stream in images from MongoDB, write to transformation stream
for await (const img of Image.aggregate(this.pipeline)) {
flattenImg.write(img);
}
flattenImg.end();
// pipe together transform and write streams
await Promise.all([streams, promise]);
ps this is a great ticket, I appreciate all of the context
@alukach yeah you and me both. Streams are decidedly not my strong suit.
That's an interesting theory I can test pretty easily buy adding a logging transform right after the flattenImg
transform, and see if that fires as flattenImg
is working or whether it waits until it's flattened all of the returned images.
I'll let you know how it goes.
@alukach I think you were right FYI - I still don't totally understand what was going on with that async iterator in relation to the stream.pipeline
, but evidently that set up was not correct. I did some logging and it wasn't actually streaming the image records at all; all of the image records were pooling up and getting processed in entirety by each stream before getting fed into the next one in the pipeline.
Anyhow, figured out a fix for it so we should be good now. Thanks for your help!
When exporting a large number of annotations to CSV (e.g. >286k, as is the case currently with SCI Biosecurity), the
task
Lambda exhausts it's 1024MB of memory. This shouldn't be the case as we're streaming records from the DB, transforming them, and then streaming the CSV rows to S3, but evidently something is not working as intended.The memory leak does not occur when exporting the exact same annotations to COCO. I logged the
process.memoryUsage()
for both processes while the annotations were being streamed in and written to S3 (at intervals of 1000 images), and these were the results when exporting to CSV (note: finalrss
was 985MB and the memory was exhausted so it never completed):And here were the results when performing a multithreaded COCO export (note: final
rss
was 341MB and the export completed):