googleapis / nodejs-storage

Node.js client for Google Cloud Storage: unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.
https://cloud.google.com/storage/
Apache License 2.0
891 stars 367 forks source link

Possible TransferManager#downloadFileInChunks issues #2372

Closed rhodgkins closed 7 months ago

rhodgkins commented 7 months ago

I was going to use TransferManager#downloadFileInChunks to download a 20GB+ file and thought I'd investigate how it works (as the docs aren't that great around it - it's not the most clear where the file ends up).

But I think there's multiple issues around the way the response is returned, the CRC32C validation and also memory usage.

https://github.com/googleapis/nodejs-storage/blob/7a96ce6f764076a14f0961623e2dec2ce8893dd7/src/transfer-manager.ts#L680

First the type of the result isn't correct, it'll return Buffer[] and not [Buffer] (which is what DownloadResponse is).

https://github.com/googleapis/nodejs-storage/blob/7a96ce6f764076a14f0961623e2dec2ce8893dd7/src/transfer-manager.ts#L682

Second the result of the CRC32C validation isn't used - shouldn't it be compared against the fileInfo metadata's CRC32 value?

Finally, related to the first issue, the function returns the complete contents of the file into memory (as an array of Buffers`)? Won't this just cause OOM issues with any large GB sized file?

Happy to do a PR to address the issues if the above assumptions are correct. In relation to the complete contents of the file being returned, could have a flag on DownloadFileInChunksOptions to signal that function should just resolve to void? Or a lot simpler (but breaking), just always resolve to void and let the user load the contents of the downloaded file into memory if that's what they want...

Note, I've checked the tests out, and they're pretty basic only checking that CRC32C.fromFile is called, and not that the actual result is used in validation.

EDIT:

https://github.com/googleapis/nodejs-storage/blob/7a96ce6f764076a14f0961623e2dec2ce8893dd7/src/transfer-manager.ts#L686

I've also noticed that the FileHandle#close() isn't awaited and should probably be done after all the chunks are downloaded (ie. after await Promise.all(promises)) and before the CRC32C is computed. Is this assumption correct and also needs to be changed?

ddelgrosso1 commented 7 months ago

@rhodgkins thanks for bringing up these issues. 1. Yes, the close should be awaited 2. You are correct about the return value it is currently mismatched. 3. I think your suggestion regarding a flag makes a lot of sense to prevent this from overrunning memory on extremely large files. Also, having a flag would prevent a breaking change which we try to avoid as much as possible as it requires a major version bump. 4. With regards to CRC32C let me take another look and circle back on this one.

We always appreciate contributions so if you want to raise a fix that would be great. If not, I can tackle these before we go into freeze for the holidays.

rhodgkins commented 7 months ago

Thanks @ddelgrosso1 - first pass is #2373, I've done an educated guess at the CRC32C check...