pendulum-chain / spacewalk

Apache License 2.0
34 stars 7 forks source link

Add changes to fix issues running the Spacewalk client in Kubernetes #519

Closed ebma closed 3 months ago

ebma commented 4 months ago

Removes the backoff crate. Assuming that the retry() function introduces some side effects, we manually implement the retry logic now. The existing constants for RETRY_TIMEOUT and RETRY_INTERVAL are kept the same way, except now they are only used to derive the number of retries.

Retrying for Subxt RPC error due to closed connection

Sometimes, the runner encounters the following error:

[2024-05-07T13:56:14Z INFO  runner::runner] Error fetching executable: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...
[2024-05-07T13:56:15Z INFO  runner::runner] Error fetching executable: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...

I found this PR which adds a new experimental implementation of an RPC client that automatically reconnects. This implementation is only available in subxt v0.35 or later. I tried bumping the subxt dependencies we use in Spacewalk to that version but I encountered conflicts because our Polkadot dependencies are too outdated. -> I created https://github.com/pendulum-chain/spacewalk/issues/521 as a follow-up and we'll ignore this issue for now.

Related to https://github.com/pendulum-chain/tasks/issues/207.

ebma commented 4 months ago

@pendulum-chain/devs this is now ready for review

ebma commented 4 months ago

Regarding the error due to close connection, could we retry a new one after a failure here by simply creating a new client instance with Ok(OnlineClient::from_url(url).await?)?

Good point. I looked into this again and realized that this logic was kind of in place before. In this loop, the runner is restarted if try_get_release() returns an error. The problem was that the changes I made were missing the maximum retry timeout that exists in the backoff crate (max_elapsed_time). Instead of defining a maximum duration for which the logic is retried, we can only specify the number of retries, but I now derive that here so the result is similar. With this, the retry_with_log() functions will now return an error in time that is either caught inside the loop I linked to previously, or if it happens elsewhere will make the runner stop entirely. Once stopped, they can be automatically restarted in the infrastructure.

gianfra-t commented 4 months ago

Once stopped, they can be automatically restarted in the infrastructure.

Okay understood, I didn't consider this behavior, makes sense!

ebma commented 3 months ago

Over the weekend, the runner client again encountered the error.

[2024-05-13T08:43:43Z INFO  runner::runner] Error reading chain storage for release: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...

Because I slightly changed the phrasing in the log messages, I was able to pin it down to this line which means that the runner was only 'hanging' in this loop statement. Now I noticed that the call to maybe_restart_client doesn't do anything in this case because the child process (the actual vault client binary) is still running and working fine. I tested it and it's still able to process issue and redeem requests, and the RPC connection of the vault client is also working. This is interesting but my assumption is that the RPC connection of client is fine because of its periodic restart every 3 hours, thus the connection is always kind of fresh, whereas the runner never restarts or refreshes its RPC connection. That's why I added some logic to just try and create a new RPC client in the runner.

ebma commented 3 months ago

Something was off with the iterator returned by the exponential-backoff crate and it kept on returning values. I removed that crate completely and now we do everything manually. I also added some small tests for the retry_with_log_... functions so that we can now be sure that they don't retry more often than they should.

Screenshot 2024-05-13 at 19 27 01
b-yap commented 3 months ago

@ebma you can squash and merge this now. I reverted to older version, as 2024-04-18 has some problems.

Pendulum's CI is using 2024-04-18, but it's using the stable spacewalk when it was still 2024-02-09; and there were no problems so far.