googleapis / java-storage

Apache License 2.0
104 stars 76 forks source link

[Upload] Handle both Files and InputStreams #2727

Open Amraneze opened 21 hours ago

Amraneze commented 21 hours ago

Upload

Description

We are using Google Cloud Storage to download (decompress) and upload those decompressed files again to Google Cloud Storage, the problem is that we are using InputStream to not overload the heap memory of the application. For that, we want to handle both cases for uploading files or input stream.

Solution

I drafted this PR#2728 as an example of what we need

Alternatives

Sticking to normal upload with Google Cloud Storage client

BenWhitehead commented 20 hours ago

Hi,

A large reason TransferManager only accepts Paths, is that Paths allow minimal memory overhead as the bytes are on disk and can therefore be read and uploaded in a small incremental fashion (8KiB at a time). Additionally, if an upload is interrupted with a retryable error we can retry from any arbitrary offset.

When an InputStream is provided to us, we have to switch to a chunked approach where we will buffer up to a certain amount of bytes (default 16MiB) before flushing that buffer to GCS. Reason being, InputStreams are not universally rewindable and if an interrupt happens while uploading the whole upload would fail. Especially an InputStream from Channels.newInputStream(storage.reader(BlobId.of("bucket-name", "object-name"))).

Transferring objects between buckets is something Storage Transfer Service has been purpose built to perform in a managed performant manner. A GCS bucket can be both a source and sink. An example of how you might transition all objects to nearline storage class should give you an idea of how to get started https://cloud.google.com/storage-transfer/docs/create-transfers#client-libraries then click the Java.