googleapis / google-cloud-php

Google Cloud Client Library for PHP
https://cloud.google.com/php/docs/reference
Apache License 2.0
1.09k stars 436 forks source link

[Spanner] Server randomly returns "ServerException: INTERNAL: Received RST_STREAM with error code 2". #5473

Closed taka-oyama closed 1 year ago

taka-oyama commented 2 years ago

Here is a more detailed error.

Google\Cloud\Core\Exception\ServerException: {
    "message": "Received RST_STREAM with error code 2",
    "code": 13,
    "status": "INTERNAL",
    "details": []
} in /project/vendor/google/cloud-core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 /project/vendor/google/cloud-core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException(Object(Google\ApiCore\ApiException))
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream(Object(Google\ApiCore\ServerStream))
#2 /project/vendor/google/cloud-spanner/src/Result.php(191): Generator->valid()
#3 [internal function]: Google\Cloud\Spanner\Result->Google\Cloud\Spanner\{closure}()
#4 /project/vendor/google/cloud-core/src/ExponentialBackoff.php(80): call_user_func_array(Object(Closure), Array)
#5 /project/vendor/google/cloud-spanner/src/Result.php(192): Google\Cloud\Core\ExponentialBackoff->execute(Object(Closure))
#6 [internal function]: Google\Cloud\Spanner\Result->rows()
....

We have been seeing this error for a few weeks now across various projects running various versions of google/cloud-spanner (including one that is running the latest v1.51.2).

I have not been able to reproduce this error since it happens randomly.

When the error occurs, it shows up in bulk within a span of a few seconds arcoss different pods on K8s. This error seems to always occur at the first query within a transaction.

Would it be possible to add a retry for this specific error here?

I'm suggesting this because google-cloud-go seems to be doing something similar.

I usually don't post issues until I have reproducible code but this has been affecting production for weeks, so I am eager to get some kind of solution to mitigate the error.

Also, does anyone here know what "error code 2" is? Understanding it might help to better understand the error.

Thanks.

Environment details

taka-oyama commented 2 years ago

I've contacted support and was informed that the Spanner team is aware of the issue and is working towards a fix.

Closing it for now.

taka-oyama commented 1 year ago

Unfortunately, this was not completely addressed and we got a WONT FIX response from google support team.

So I believe this error is here to stay.

Would it be possible to add this error to the auto-retry mechanism below?

https://github.com/googleapis/google-cloud-php/blob/e9ccbbb0ae5060e875b9b35e9711aee85f21a485/Spanner/src/Database.php#L863-L865

Go's library already added a retry for all internal server errors recently (probably for the same reason). https://github.com/googleapis/google-cloud-go/pull/6699

bshaffer commented 1 year ago

This seems to be a feature request to retry certain error responses. This behavior is in the works!

See https://github.com/googleapis/google-auth-library-php/pull/359

taka-oyama commented 1 year ago

Thank you! Hope to see that get merged soon!

taka-oyama commented 1 year ago

Java client added a fix to this as well.

https://github.com/googleapis/java-spanner/pull/2111

zeriyoshi commented 1 year ago

I too hope this response is incorporated ASAP.

vishwarajanand commented 1 year ago

We are clarifying internally whether (easy to fix) simply adding a retry in existing call is sufficient or (will take longer) we need to re-create a connection.