pendulum-chain / spacewalk

Apache License 2.0
34 stars 7 forks source link

Fix occasional 'HorizonResponse: DecodeError' #501

Closed ebma closed 5 months ago

ebma commented 5 months ago

Sometimes when the vault client is trying to submit a Stellar transaction as part of redeem requests, the horizon responds with a lot of timeout errors. It seems like at some point, this timeout error changes and the horizon server instead returns a different response that the vault client fails to decode. The vault client then stops trying to execute the redeem request until the next periodic restart.

Example 1: Logs of a Spacewalk client

Mar 27 10:37:41.221 ERROR transfer_stellar_asset{request_type=Redeem request_id=0xb93fbb834856873b5afae7841249305eb58557bc10d3f7b2cf2f941591502548}: wallet::horizon::responses: Response returned an error: HorizonSubmissionError { title: "Timeout", status: 504, reason: "Your request timed out before completing.  Please try your request again. If you are submitting a transaction make sure you are sending exactly the same transaction (with the same sequence number).", result_code_op: [], envelope_xdr: None }
Mar 27 10:37:41.221  WARN transfer_stellar_asset{request_type=Redeem request_id=0xb93fbb834856873b5afae7841249305eb58557bc10d3f7b2cf2f941591502548}: wallet::horizon::horizon: submitting transaction to https://horizon.stellar.org with seq no: Some(197273654701064434) failed with HorizonSubmissionError { title: "Timeout", status: 504, reason: "Your request timed out before completing.  Please try your request again. If you are submitting a transaction make sure you are sending exactly the same transaction (with the same sequence number).", result_code_op: [],envelope_xdr: None }
Mar 27 10:38:32.111 ERROR transfer_stellar_asset{request_type=Redeem request_id=0xb93fbb834856873b5afae7841249305eb58557bc10d3f7b2cf2f941591502548}: wallet::horizon::responses: Response returned an error: HorizonSubmissionError { title: "Timeout", status: 504, reason: "Your request timed out before completing.  Please try your request again. If you are submitting a transaction make sure you are sending exactly the same transaction (with the same sequence number).", result_code_op: [], envelope_xdr: None }
Mar 27 10:38:32.111  WARN transfer_stellar_asset{request_type=Redeem request_id=0xb93fbb834856873b5afae7841249305eb58557bc10d3f7b2cf2f941591502548}: wallet::horizon::horizon: submitting transaction to https://horizon.stellar.org with seq no: Some(197273654701064434) failed with HorizonSubmissionError { title: "Timeout", status: 504, reason: "Your request timed out before completing.  Please try your request again. If you are submitting a transaction make sure you are sending exactly the same transaction (with the same sequence number).", result_code_op: [],envelope_xdr: None }
Mar 27 10:39:32.149 ERROR vault::redeem: Failed to process Redeem request #0xb93f…2548 due to error: StellarWalletError: Error fetching horizon data: error decoding response body: expected value at line 1 column 1

Example 2: Logs of this failing CI job:

thread 'test_redeem_succeeds_on_mainnet' panicked at 'should return ok: HorizonResponseError(reqwest::Error { kind: Decode, source: Error("expected value", line: 1, column: 1) })', clients/vault/tests/helper/helper.rs:196:6

TODO

Make sure that the vault client does not stop retrying the submission of the Stellar transaction for redeem requests.

ebma commented 5 months ago

@pendulum-chain/product this should further reduce the sometimes very long time required for redeem requests.

b-yap commented 5 months ago

finally found more info on the nth run:

Apr 01 06:09:16.840  INFO wallet::horizon::responses: interpret_response(): status: 403
Apr 01 06:09:16.840 ERROR wallet::horizon::responses: interpret_response(): got a failure response but failure for returning: wallet::horizon::responses::TransactionResponse
Apr 01 06:09:16.840 ERROR wallet::horizon::responses: interpret_response(): got a failure response but failed to convert to json
Apr 01 06:09:16.840 ERROR wallet::horizon::responses: interpret_response(): got a failure response in string: Ok("error code: 1020")
ebma commented 5 months ago

Nice find 👍 It's a bit weird. I found this section in the docs that mentions that we'd receive a 429 error when exceeding the rate limits. However, this 'error code: 1020' apparently is related to a cloudflare error that is thrown by some firewall rules. So maybe they have two things in place for rate-limiting and we are getting blocked by cloudflare when issuing too many requests. In any case, we should probably consider this a recoverable error so that the vault continues to retry.