googleapis / google-cloud-php

Google Cloud Client Library for PHP
https://cloud.google.com/php/docs/reference
Apache License 2.0
1.09k stars 434 forks source link

Cloud Storage - Implement method to check existence of multiple objects in a single operation #2337

Open bduclaux opened 4 years ago

bduclaux commented 4 years ago

Hello

We are using the PHP cloud storage library, and we are facing a performance issue to check existence of multiple objects in a storage bucket. Currently, the only way to implement such check is by using a loop such as:

$names=["file1","file2",...,"fileN"];
foreach ($names as $name) 
    {
    $object=$bucket->object($name);
    if (!$object->exists()) {...};
    }

This triggers a REST api call for each of the objects, which is slow. We usually have around 10 object names per loop, so this takes around 0.4s. As we do this a lot of times, we have a performance issue.

It would be great to have a method in the Bucket class to check multiple object names at once, with a single request to the cloud storage back-end APIs (not sure such method exists in the back-end API).

Thanks !

andrewinc commented 4 years ago

@dwsupplee, I am working on this. I would like to coordinate the design of the solution. $object->exists() is based on the absence of exceptions in $connection->getObject(...) - I think this is the longest operation.

  1. I propose a solution based on $bucket->objects() and further in the loop, finding names from the list. This solution will require only one request.
  2. A more complicated solution - Adding a new method to Google\Cloud\Storage\Connection\ConnectionInterface and implementing it in Google\Cloud\Storage\Connection\Rest
  3. Bad solution: cram the above code into a separate method of the Bucket object. This will not give any gain in speed.

I myself would choose No. 1

As for the method of the Bucket class that @bduclaux requested, there is a question about handling the result.

  1. For example, you can return a list of only existing names: $bucket->objectsExists(["file1","file2",...,"fileN"]); // ["file1",...,"fileN"]

  2. You can return an associative array with keys from the original list: $bucket->objectsExists(["file1","file2",...,"fileN"]); // ["file1"=>true,"file2"=>false,...,"fileN"=>true]

  3. A variant is possible when an associative array is passed by reference, then the method can return true if all names from the list are found:

    $names=["file1"=>null,"file2"=>null,...,"fileN"=>null];
    $result=$bucket->objectsExists($names);
    // $result=false; (true if all exists) $names=["file1"=>true,"file2"=>false,...,"fileN"=>true];
  4. There is also an option with a callback function to process each name. Then the method can return true if all the names from the list are found. But this solution is not quite in the style of PHP (rather node.js)

    $result=$bucket->objectsExists(["file1","file2",...,"fileN"], function($name, $exists){...});
    // $result=false; (true if all exists)

I myself would choose No. 1

bduclaux commented 4 years ago

Hi @andrewinc , thanks ! Please also take into account the cost of the queries to the API. Class A queries are more expensive than class B queries (see https://cloud.google.com/storage/pricing ). Getting the full list of objects might take a bit of time for large buckets, unless you specify a prefix.

andrewinc commented 4 years ago

Yes @bduclaux You're right. This operation $bucket->objects() will be charged as a class A operation , i.e. will cost 10 times more expensive than $object->exists() and therefore appropriate when requesting 10 or more objects, i.e. for example, in your case:

We usually have around 10 object names per loop

As for the large list, you are also right that the list can be shortened by specifying prefix if desired. Of course, you need to leave this feature provided for in $bucket->objects(['prefix' =>...]);

I would like to know what @dwsupplee will write about this. Maybe it will offer a different solution.

dwsupplee commented 4 years ago

@andrewinc thanks so much for taking the time to put some thoughts together on this, and thank you @bduclaux for the feature request. We'd definitely love to add support for something like this.

We've been laying the groundwork for exposing asynchronous network requests for some time now. This should allow us to expose something which looks like the following:

use GuzzleHttp\Promise;
use Google\Cloud\Storage\StorageClient;

$bucket = (new StorageClient())->bucket('my-bucket');
$promises = [];
$objectNames = ['a.txt', 'b.txt', 'c.txt'];

foreach ($objectNames as $objectName) {
    $bucket->object($objectName)
        ->existsAsync()
        ->then(function ($exists) {
            echo "$objectName: $exists" . PHP_EOL;
        });
}

Promise\unwrap($promises);

We've done this as a "one-off" over on StorageObject::downloadAsStreamAsync, with the plan being to expose the rest of the async method counterparts across the Storage library as part of our 2.0 version bump (we don't have a clear ETA for this at the moment).

Another option would be to expose the batch API through our storage client, this would allow interweaving up to 100 requests together into a single API request. It looks like some work in progress to define a plan for how we can expose this across languages. I'll check in and see where this progress is at, but will note it could require breaking changes to the library as well.

I prefer these approaches over the list objects implementation because I'm apprehensive of edge case scenarios like the following:

I have 100,000 objects in my bucket. I want to check objects "a.txt" and "z.txt" exist. "a.txt" happens to be object 1/100,000 returned, while "z.txt" is object 100,000. The max results returned from a single RPC to list objects is 1,000 - meaning I'd have to page through 100 times to get to "z.txt". The end cost is ~100 RPCs to check for two objects.