spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server"

googleapis / google-cloud-cpp

C++ Client Libraries for Google Cloud Services

https://cloud.google.com/

Apache License 2.0

545 stars 369 forks source link

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #4714

Closed thiagotnunes closed 3 years ago

thiagotnunes commented 4 years ago

This bug is related to the Spanner client library.

For long lived transactions (>= 30 minutes), in the case of large PDML changes, it is possible that the gRPC connection is terminated with an error "Received unexpected EOS on DATA frame from server".

In this case, we need to retry the transaction either with the received resume token obtained on reading the stream or from scratch. This will ensure that the PDML transaction continues to execute until it is successful or a hard timeout is reached.

We have already implemented such change in the Java client library, for more information see this PR: googleapis/java-spanner#360.

In order to test the fix, we can use a large spanner database. Please speak to @thiagotnunes for more details.

mr-salty commented 4 years ago

@thiagotnunes I see the java change looks for any of these strings, I assume we should do the same?

"HTTP/2 error code: INTERNAL_ERROR" "Connection closed with unknown cause" "Received unexpected EOS on DATA frame from server"

thiagotnunes commented 4 years ago

@mr-salty There is another error that we have not seen in the Java client library, but we have seen in other libraries (which is curious): "RST_STREAM" (https://github.com/googleapis/python-spanner/pull/122/files#diff-81c1269f69a551cb02a056013d0db2e3R37).

If you'd like to be cover all grounds, I would retry on the 3 you mentioned and the RST_STREAM one.

mr-salty commented 4 years ago

Hm, I think we'll need some larger changes to address this in C++ - currently we use a non-streaming ExecuteSql call with a timeout of 10 minutes.

IIUC Java uses a streaming call (does that imply we do periodically receive resume_token responses from the backend?), and has an overall timeout of 2 hours. I was also looking at #4528 which doesn't have a specific PDML timeout, so I assume we would use the ExecuteStreamingSql timeout of 1 hour? or, is the 2 hours important?

@thiagotnunes I normally work pretty late if you're available for a chat sometime later

thiagotnunes commented 4 years ago

@mr-salty sorry I think I missed you. I scheduled a meeting for us to go over it next week.

devjgm commented 4 years ago

I see PRs for this. Is this issue fixed?

mr-salty commented 4 years ago

I still have a PR pending and need to test it with a real long-running query (Thiago added me to the relevant project and sent me instructions)

coryan commented 3 years ago

Is this done? It is now out of SLO.

mr-salty commented 3 years ago

greg and I discussed this last week. I think we can close this bug because what is (possibly) left to do is change the retry timeouts per #4528 , which we weren't able to reach consensus on.

If a user had long running-queries and manually set the timeouts long enough, they should not run into the issue (not properly resuming) that was the initial motivation for this issue. with the default timeouts, their query would time out before they ever saw this issue.