Closed oprudkyi closed 1 year ago
@oprudkyi thanks for reporting this issue. There are reasons due to which all UNAVAILABLE errors are not recommended for retry such as non-idempotent calls.
Can you help me understand if this is a rarely occurring issue, can a retry on user code be helpful?
Meanwhile, I am trying to understand what should be the behavior when there's no resume token.
Hi @vishwarajanand
There are reasons due to which all UNAVAILABLE errors are not recommended for retry such as non-idempotent calls.
as for me it looks like idempotent - it calls SELECT via https://cloud.google.com/spanner/docs/reference/rpc/google.spanner.v1#google.spanner.v1.Spanner.ExecuteSql
Meanwhile, I am trying to understand what should be the behavior when there's no resume token.
return original call ? i.e.
$generator = $backoff->execute($call);
@oprudkyi I couldn't get the reason(s) why a resume token isn't there for certain rare cases like you pointed.
But assuming that just happens androws(..)
returns a generator, a retry without resume token will give unwanted duplicate results.
I recommend a user level retry strategy to handle such cases.
@vishwarajanand it's not rare, we are getting 50+ errors daily on prod. and I don't understand why it can't be fixed there
just for info, the next ugly workaround helped. The first retry passes without error
/**
* @inheritDoc
* exponential back-off on unavailable errors
* https://github.com/googleapis/google-cloud-php/issues/5880
*/
public function select($query, $bindings = [], $useReadPdo = true): array
{
$maxAttempts = 8;
$attempts = 0;
do {
$attempts++;
try {
return parent::select($query, $bindings, $useReadPdo);
} catch (QueryException $e) {
$errorMessage = $e->getMessage();
if (
$attempts <= $maxAttempts
&& strpos($errorMessage, "UNAVAILABLE") !== false
&& strpos($errorMessage, "Broken pipe") !== false
) {
// Broken pipe error, use exponential back-off, from 0.2sec to 25 sec
usleep(100000 * pow(2, $attempts));
} else {
throw $e;
}
}
} while (true);
}
where parent::select is from https://github.com/colopl/laravel-spanner/blob/fbd7a2cab4b0b5e4d11857d356b4518177aed4e3/src/Connection.php#L258
/**
* @inheritDoc
*/
public function select($query, $bindings = [], $useReadPdo = true): array
{
return $this->run($query, $bindings, function ($query, $bindings) {
if ($this->pretending()) {
return [];
}
$generator = $this->getDatabaseContext()
->execute($query, ['parameters' => $this->prepareBindings($bindings)])
->rows();
return iterator_to_array($generator);
});
}
@taka-oyama but I am still unsure if this this workaround needed in laravel-spanner
I just checked the logs for our service and saw similar errors at the rate of 1~2 an hour.
"message": "Connection reset by peer",
"code": 14,
"status": "UNAVAILABLE",
"details": []
Can we do something like
Within transaction -> Retry the whole transaction through below.
Outside the transaction -> Retry individual queries through below.
Nodejs version seems to have fixed this back in 2020. https://github.com/googleapis/nodejs-spanner/pull/795
Dotnet has fixed a similar issue. https://github.com/googleapis/google-cloud-dotnet/issues/5977
Hi @taka-oyama Thank you
Within transaction -> Retry the whole transaction through below.
For us there weren't any exceptions inside explicit transactions. just within transaction-less selects. I suspect transactions somehow are retried already
Looking at node, dotnet, and java, @olavloite seems to know a lot about how it works and has made PRs for many similar issues.
@olavloite, would it be possible to get some insight on how this is suppose to be handled?
(Caution, not a PHP expert here, so I might be reading something wrong in this code)
UNAVAILABLE
errors that happen for a SELECT
statement can generally be retried safely. The logic should be:
ExecuteStreamingSql
fails with an UNAVAILABLE
error, the initial call to ExecuteStreamingSql
can be retried without a resume token.ExecuteStreamingSql
should buffer all rows it receives until it sees a resume token, or until its buffer is full. Normally the first will happen. It then emits the rows it has seen so far. This is also the logic that is implemented in the PHP client library.ExecuteStreamingSql
call should be retried with the last seen resume token. If there is no resume token and the stream has emitted rows, then retrying is not safe. This is a very rare case. If the stream fails halfway and there is no resume token, but the method has also not emitted any rows, then it's also safe to just retry the initial ExecuteStreamingSql
call without a resume token.From what I can see the only thing that needs to be modified here is this line: https://github.com/googleapis/google-cloud-php/blob/fda8c6c0061184e889d039eb4bbb2e88e35be9a7/Spanner/src/Result.php#L180
It is safe to retry the initial call as long as the method has not returned any rows to the caller.
Wow! Thank you so much for the quick response! 👏🥹
@vishwarajanand Now that we understand how dotnet and js libraries handle this, would it be possible to get a fix for this?
I raised a fix to retry when no results have yielded, with or without the resume token.
Thank you!
Environment details
Steps to reproduce
it happens very rarely , once per day or week, on relatively loaded clusters and with simple random queries and I can't reproduce it
according to https://cloud.google.com/spanner/docs/error-codes the call should be retried
but according to source https://github.com/googleapis/google-cloud-php/blob/fda8c6c0061184e889d039eb4bbb2e88e35be9a7/Spanner/src/Result.php#L231 there are cases when call won't be retried
i.e. it retried only when there is set $resumeToken , but it never set for queries with small results
sample code I used