snowflakedb / snowflake-ingest-java

Java SDK for the Snowflake Ingest Service -
http://www.snowflake.net
Apache License 2.0
71 stars 57 forks source link

Wire up the ETag from S3's upload response back to the BlobDTO's MD5 field, to handle multipart upload correctly #915

Closed sfc-gh-hmadan closed 21 minutes ago

sfc-gh-hmadan commented 1 day ago

With the late breaking design change to use subscoped tokens instead of direct S3 PUTs, we ended up using the same file handling logic as BDECs. This meant going through JDBC and the S3 SDK.

As part of recent testing I've discovered that for files greater than 16MB, S3 splits the file into a multipart upload. The ETag of such a file is NOT the MD5 hash, which is what's also documented.

For BDECs, we calculate the MD5 hash ourselves and send it to snowflake, where it's stored in the fileContentKey field. For parquet files operating specifically in the iceberg table, there is a check in XP to ensure that the ETag of the blob being read is identical to the fileContentKey stored in snowflake metadata.

Connecting these dots - what's happening before this fix is that for iceberg ingestion of files greater than 16 MB, the SDK sends the MD5 hash into the fileContentKey property whereas XP expects it to be the ETag value (which is NOT the MD5 of the contents IF its a multipart upload).

The proper fix is to make JDBC return the ETag value after uploading the file, through all the layers of JDBC classes, to the API that ingest SDK uses (uploadWithoutConnection).

Since we need to fix this right away, this PR copies over those parts of JDBC that are used for iceberg ingestion. As soon as JDBC driver has the new fix we'll remove all these classes.

Note that this PR accidentally changes the timeout to 20 seconds, another PR tomorrow is going to make that change and i'll back it out of this branch before merging.

sfc-gh-hmadan commented 1 day ago

@sfc-gh-alhuang since the iceberg merge gate does not go against GCS, can you please:

  1. test against gcs
  2. add encrypted profile.json files for the preprods too (right now its just the prod profile.jsons)