googleapis / google-cloud-cpp

C++ Client Libraries for Google Cloud Services
https://cloud.google.com/
Apache License 2.0
549 stars 369 forks source link

Implement ParallelInsertObject() #11191

Open coryan opened 1 year ago

coryan commented 1 year ago

This is motivated by a question (#11145). Some applications have largish in-memory buffers (think "10 GiB") and would like to parallelize the upload. Maybe use parallel single-shot uploads and then compose the results.


that I want to upload to GCS. PrepareParallelUpload seems like a fit to my needs,

Maybe. That would use resumable uploads, which are slower than single-shot uploads. The new non-copying version of InsertObject() would allow you to upload those buffers without making copies. These will appear in the next release:

https://googleapis.dev/cpp/google-cloud-storage/HEAD/classgoogle_1_1cloud_1_1storage_1_1Client.html#ace17975b9eeae9df0da67712a430a349

For in-memory buffers that may be the most efficient way to upload the buffer. If the buffers are not too big (say up to a few GiB), something like this could work:

namespace g = google::cloud;
namespace gcs = google::cloud::storage;

// TODO: actually compile and test this.
// TODO: the error handling is a bit naive.
gcs::ObjectMetadata
ParallelUpload(
    gcs::Client client, std::string bucket_name, std::string object_name, absl::string_view data) {
  auto const kShardSize = 32 * 1024 * 1024L;
  auto const prefix = gcs::CreateRandomPrefix();
  struct Shard {
    std::string name;
    absl::string_view data;
  };
  std::vector<Shard> shards;
  shards.reserve(data.size() / kShardSize + 1);
  int counter = 0;
  for (std::size_t offset = 0; offset < data.size(); offset += kShardSize) {
      shards.push_back(Shard{prefix + "/shard-" + counter, Shard{data.substr(offset, kShardSize)});
  }
  std::vector<std::future<g::StatusOr<gcs::ObjectMetadata>>> tasks(shards.size());
  std::transform(shards.begin(), shards.end(), tasks.begin(), [&](auto shard) mutable {
    return std::async(std::launch::async, [](auto client, auto bucket, auto object, auto data) {
        return client.InsertObject(bucket, object, data);
      }, client, bucket_name, shard.name, shard.data);   
  });
  std::vector<gcs::ObjectMetadata> results(tasks.size());
  std::transform(tasks.begin(), tasks.end(), results.begin(), [](auto& f) {
      auto metadata = f.get();
      if (!metadata) throw std::move(metadata).status();
     return *std::move(metadata);
  });
  std::vector<gcs::ComposeSourceObject> sources(results.size());
  std::transform(results.begin(), results.end(), sources.begin(), [](auto m) {
     return ComposeSourceObject{m.name(), m.generation(), absl::nullopt};
  });
  auto metadata = gcs::ComposeMany(
      client, bucket_name, sources, prefix, object_name, /*ignore_cleanup_failures=*/true);
  if (!metadata) throw std::move(metadata).status();
  auto status = gcs::DeleteByPrefix(client, bucket_name, prefix);
  if (!status.ok()) throw status;
  return *std::move(metadata);
}

For much larger buffers (say starting at around 4 GiB) you may want to create fewer shards, or use resumable uploads (to avoid resending data on partial failures).

but unfortunately it is declared in the google::cloud::storage::internal namespace, discouraging external code from using it.

Correct.

Describe the solution you'd like I would like PrepareParallelUpload (and CreateUploadShards, might be useful for others) to become available in the public google::cloud::storage namespace. Currently only ParallelUploadFile is public; don't know if it's intentional or an oversight.

It is intentional, we did not know if the API was useful or if we wanted to tie the implementation down by exposing the implementation details. For example, we could change the implementation to use single-shot uploads (when the shards are small enough), or to use XML multipart uploads, or to use the asynchronous gRPC client (which is progressing too slowly), and then this function would disappear or its interface would completely change.

Using them from the internal namespace doesn't sound a good idea.

Certainly not recommended.

Originally posted by @coryan in https://github.com/googleapis/google-cloud-cpp/issues/11145#issuecomment-1490320273

coryan commented 1 year ago

Will consider in 2023/Q4.

coryan commented 9 months ago

Will consider in 2024/Q2.

scotthart commented 1 month ago

Determine if we still need to do this.