SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files

Please answer these questions before submitting your issue. In order to accurately debug the issue this information is required. Thanks!

What version of NodeJS driver are you using? 1.9.3
What operating system and processor architecture are you using? MacOS arm64
What version of NodeJS are you using? (node --version and npm --version) node : 18.12.1 , npm: 8.19.2
What are the component versions in the environment (npm list)? NA
Server version: 8.9.1
What did you do?

Issue Summary

While executing a PUT query to stage a large, compressed CSV file from the local file system to a Snowflake stage (S3), the memory usage of the snowflake-sdk grows significantly, especially with large files. During the execution, the Snowflake SDK performs several operations:
Compression (if the file is not already compressed),
SHA-256 Digest Calculation,
AES Encryption,
Upload to S3 (or other remote storage).

While these steps are necessary, the SDK's memory footprint grows significantly based on the file size, which appears to be due to the following reasons:

Digest Calculation:
- The SDK calculates the SHA-256 digest of the file by reading the entire file into memory (Ref code).
- For large files, this leads to high memory consumption, which can cause memory-related issues or crashes.
- Suggestion: Instead of loading the entire file into memory, the hash can be calculated incrementally for each chunk of data as it is read. This is possible by updating the hash during the streaming process, reducing the memory footprint. (Crypto module Ref - This can be called many times with new data as it is streamed.)
File Upload:
- When the SDK attempts to upload the encrypted file to the remote storage provider (S3, GCS, Azure), it reads the file synchronously into memory (using readFileSync), which again leads to excessive memory consumption for large files. [Ref Code - S3, GCS, Azure]
- Suggestion: The SDK should leverage streams (createReadStream) during the upload process instead of reading the entire file into memory. Streaming the file to the storage provider would significantly reduce the memory overhead, especially for large files.

Steps to Reproduce:

Prepare a large, compressed CSV file (e.g., several GB in size) [ Example script to generate the necessary data file - file-gzip.txt].
Use the following script to execute a PUT query to upload the file to a Snowflake stage (S3 in my case) [Script - Execute_PUT.txt]

While executing the query, monitor memory usage using tools like: Node.js process memory logging, clinic doctor or any external memory profiling tool.

What did you expect to see?
- Ideally, the SDK should minimise memory consumption by using a streaming approach for both the digest calculation and file upload steps. This would help in handling large files more efficiently.
Can you set logging to DEBUG and collect the logs? No
What is your Snowflake account identifier, if any? (Optional)

snowflakedb / snowflake-connector-nodejs

SNOW-1703685: High Memory Usage during PUT Query execution for Large GZIP compressed CSV files #922

Issue Summary