Closed thiagotnunes closed 3 years ago
@thiagotnunes I see the java change looks for any of these strings, I assume we should do the same?
"HTTP/2 error code: INTERNAL_ERROR" "Connection closed with unknown cause" "Received unexpected EOS on DATA frame from server"
@mr-salty There is another error that we have not seen in the Java client library, but we have seen in other libraries (which is curious): "RST_STREAM" (https://github.com/googleapis/python-spanner/pull/122/files#diff-81c1269f69a551cb02a056013d0db2e3R37).
If you'd like to be cover all grounds, I would retry on the 3 you mentioned and the RST_STREAM one.
Hm, I think we'll need some larger changes to address this in C++ - currently we use a non-streaming ExecuteSql
call with a timeout of 10 minutes.
IIUC Java uses a streaming call (does that imply we do periodically receive resume_token
responses from the backend?), and has an overall timeout of 2 hours. I was also looking at #4528 which doesn't have a specific PDML timeout, so I assume we would use the ExecuteStreamingSql
timeout of 1 hour? or, is the 2 hours important?
@thiagotnunes I normally work pretty late if you're available for a chat sometime later
@mr-salty sorry I think I missed you. I scheduled a meeting for us to go over it next week.
I see PRs for this. Is this issue fixed?
I still have a PR pending and need to test it with a real long-running query (Thiago added me to the relevant project and sent me instructions)
Is this done? It is now out of SLO.
greg and I discussed this last week. I think we can close this bug because what is (possibly) left to do is change the retry timeouts per #4528 , which we weren't able to reach consensus on.
If a user had long running-queries and manually set the timeouts long enough, they should not run into the issue (not properly resuming) that was the initial motivation for this issue. with the default timeouts, their query would time out before they ever saw this issue.
This bug is related to the Spanner client library.
For long lived transactions (>= 30 minutes), in the case of large PDML changes, it is possible that the gRPC connection is terminated with an error "Received unexpected EOS on DATA frame from server".
In this case, we need to retry the transaction either with the received resume token obtained on reading the stream or from scratch. This will ensure that the PDML transaction continues to execute until it is successful or a hard timeout is reached.
We have already implemented such change in the Java client library, for more information see this PR: googleapis/java-spanner#360.
In order to test the fix, we can use a large spanner database. Please speak to @thiagotnunes for more details.