googleapis / google-cloud-php

Google Cloud Client Library for PHP
https://cloud.google.com/php/docs/reference
Apache License 2.0
1.09k stars 436 forks source link

[spanner] : STATUS_UNAVAILABLE isn't retried in some cases in Result->rows() #5880

Closed oprudkyi closed 1 year ago

oprudkyi commented 1 year ago

Environment details

Steps to reproduce

it happens very rarely , once per day or week, on relatively loaded clusters and with simple random queries and I can't reproduce it

 PHP Notice: Google\Cloud\Core\Exception\ServiceException: {
    "message": "Broken pipe",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in /var/www/vendor/google/cloud-core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 /var/www/vendor/google/cloud-core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException(Object(Google\ApiCore\ApiException))
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream(Object(Google\ApiCore\ServerStream))
#2 /var/www/vendor/google/cloud-spanner/src/Result.php(228): Generator->next()
#3 [internal function]: Google\Cloud\Spanner\Result->rows()

according to https://cloud.google.com/spanner/docs/error-codes the call should be retried

UNAVAILABLE | The server is currently unavailable | Retry using exponential backoff. Note that it is not always safe to retry non-idempotent operations.

but according to source https://github.com/googleapis/google-cloud-php/blob/fda8c6c0061184e889d039eb4bbb2e88e35be9a7/Spanner/src/Result.php#L231 there are cases when call won't be retried

                $hasResumeToken = $this->isSetAndTrue($result, 'resumeToken');
                if ($hasResumeToken || count($bufferedResults) >= self::BUFFER_RESULT_LIMIT) {
                ...
                    $shouldRetry = $hasResumeToken;
                ...    
                $generator->next();
                $valid = $generator->valid();
            } catch (ServiceException $ex) {
                if ($shouldRetry && $ex->getCode() === Grpc\STATUS_UNAVAILABLE) {
                    // Attempt to resume using our last stored resume token. If we
                    // successfully resume, flush the buffer.
                    $generator = $backoff->execute($call, [$this->resumeToken]);
                    $bufferedResults = [];

                    continue;
                }

i.e. it retried only when there is set $resumeToken , but it never set for queries with small results

array (
  'metadata' => 
  array (
    'rowType' => 
    array (
      'fields' => 
      array (
        0 => 
        array (
          'name' => '',
          'type' => 
          array (
            'code' => 2,
            'typeAnnotation' => 0,
          ),
        ),
      ),
    ),
    'transaction' => 
    array (
      'id' => '',
    ),
  ),
  'values' => 
  array (
    0 => '1',
  ),
  'chunkedValue' => false,
  'resumeToken' => '',
)   

sample code I used

        $project = '';
        $instance = '';
        $database = '';
        $spanner = new SpannerClient([ 'projectId' => $project ]);
        $connection = $spanner->connect($instance, $database);
        $generator = $connection
            ->execute('SELECT 1')
            ->rows();

        iterator_to_array($generator);
vishwarajanand commented 1 year ago

@oprudkyi thanks for reporting this issue. There are reasons due to which all UNAVAILABLE errors are not recommended for retry such as non-idempotent calls.

Can you help me understand if this is a rarely occurring issue, can a retry on user code be helpful?

vishwarajanand commented 1 year ago

Meanwhile, I am trying to understand what should be the behavior when there's no resume token.

oprudkyi commented 1 year ago

Hi @vishwarajanand

There are reasons due to which all UNAVAILABLE errors are not recommended for retry such as non-idempotent calls.

as for me it looks like idempotent - it calls SELECT via https://cloud.google.com/spanner/docs/reference/rpc/google.spanner.v1#google.spanner.v1.Spanner.ExecuteSql

Meanwhile, I am trying to understand what should be the behavior when there's no resume token.

return original call ? i.e.

 $generator = $backoff->execute($call);
vishwarajanand commented 1 year ago

@oprudkyi I couldn't get the reason(s) why a resume token isn't there for certain rare cases like you pointed. But assuming that just happens androws(..) returns a generator, a retry without resume token will give unwanted duplicate results. I recommend a user level retry strategy to handle such cases.

oprudkyi commented 1 year ago

@vishwarajanand it's not rare, we are getting 50+ errors daily on prod. and I don't understand why it can't be fixed there

oprudkyi commented 1 year ago

just for info, the next ugly workaround helped. The first retry passes without error

    /**
     * @inheritDoc
     * exponential back-off on unavailable errors
     * https://github.com/googleapis/google-cloud-php/issues/5880
     */
    public function select($query, $bindings = [], $useReadPdo = true): array
    {
        $maxAttempts = 8;
        $attempts = 0;
        do {
            $attempts++;
            try {
                return parent::select($query, $bindings, $useReadPdo);
            } catch (QueryException $e) {
                $errorMessage = $e->getMessage();
                if (
                    $attempts <= $maxAttempts
                    && strpos($errorMessage, "UNAVAILABLE") !== false
                    && strpos($errorMessage, "Broken pipe") !== false
                ) {
                    // Broken pipe error, use exponential back-off, from 0.2sec to 25 sec
                    usleep(100000 * pow(2, $attempts));
                } else {
                    throw $e;
                }
            }
        } while (true);
    }

where parent::select is from https://github.com/colopl/laravel-spanner/blob/fbd7a2cab4b0b5e4d11857d356b4518177aed4e3/src/Connection.php#L258

    /**
     * @inheritDoc
     */
    public function select($query, $bindings = [], $useReadPdo = true): array
    {
        return $this->run($query, $bindings, function ($query, $bindings) {
            if ($this->pretending()) {
                return [];
            }

            $generator = $this->getDatabaseContext()
                ->execute($query, ['parameters' => $this->prepareBindings($bindings)])
                ->rows();

            return iterator_to_array($generator);
        });
    }

@taka-oyama but I am still unsure if this this workaround needed in laravel-spanner

taka-oyama commented 1 year ago

I just checked the logs for our service and saw similar errors at the rate of 1~2 an hour.

    "message": "Connection reset by peer",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []

Can we do something like

Within transaction -> Retry the whole transaction through below.

https://github.com/googleapis/google-cloud-php/blob/104537e8664db09fc07d927633f70706b0a0c7a6/Spanner/src/Database.php#L845

Outside the transaction -> Retry individual queries through below.

https://github.com/googleapis/google-cloud-php/blob/104537e8664db09fc07d927633f70706b0a0c7a6/Spanner/src/Result.php#L175

taka-oyama commented 1 year ago

Nodejs version seems to have fixed this back in 2020. https://github.com/googleapis/nodejs-spanner/pull/795

Dotnet has fixed a similar issue. https://github.com/googleapis/google-cloud-dotnet/issues/5977

oprudkyi commented 1 year ago

Hi @taka-oyama Thank you

Within transaction -> Retry the whole transaction through below.

For us there weren't any exceptions inside explicit transactions. just within transaction-less selects. I suspect transactions somehow are retried already

taka-oyama commented 1 year ago

Looking at node, dotnet, and java, @olavloite seems to know a lot about how it works and has made PRs for many similar issues.

@olavloite, would it be possible to get some insight on how this is suppose to be handled?

olavloite commented 1 year ago

(Caution, not a PHP expert here, so I might be reading something wrong in this code)

UNAVAILABLE errors that happen for a SELECT statement can generally be retried safely. The logic should be:

  1. If the initial call to ExecuteStreamingSql fails with an UNAVAILABLE error, the initial call to ExecuteStreamingSql can be retried without a resume token.
  2. The method that reads the stream that is returned by ExecuteStreamingSql should buffer all rows it receives until it sees a resume token, or until its buffer is full. Normally the first will happen. It then emits the rows it has seen so far. This is also the logic that is implemented in the PHP client library.
  3. If the stream fails halfway, the ExecuteStreamingSql call should be retried with the last seen resume token. If there is no resume token and the stream has emitted rows, then retrying is not safe. This is a very rare case. If the stream fails halfway and there is no resume token, but the method has also not emitted any rows, then it's also safe to just retry the initial ExecuteStreamingSql call without a resume token.

From what I can see the only thing that needs to be modified here is this line: https://github.com/googleapis/google-cloud-php/blob/fda8c6c0061184e889d039eb4bbb2e88e35be9a7/Spanner/src/Result.php#L180

It is safe to retry the initial call as long as the method has not returned any rows to the caller.

taka-oyama commented 1 year ago

Wow! Thank you so much for the quick response! 👏🥹

taka-oyama commented 1 year ago

@vishwarajanand Now that we understand how dotnet and js libraries handle this, would it be possible to get a fix for this?

vishwarajanand commented 1 year ago

I raised a fix to retry when no results have yielded, with or without the resume token.

taka-oyama commented 1 year ago

Thank you!