hyperledger / cacti

Hyperledger Cacti is a new approach to the blockchain interoperability problem
https://wiki.hyperledger.org/display/cactus
Apache License 2.0
339 stars 278 forks source link

fix(connector-besu): do not crash if ledger unreachable - send HTTP 503 #3406

Open petermetz opened 1 month ago

petermetz commented 1 month ago

Description

Right now the API server will exit with a crash if the Besu connector receives a request and the backing ledger is not accessible. Instead of a full-on crash, we need to treat this as a recoverable error and send back HTTP 503 - Service Unavailable with a Retry-After header with exponential backoff where it starts from 1 second and the doubles every re-try.

The easiest way to reproduce is to follow this guide and remove the part of the command from step 2 that binds the API server to the host machine's network: --network=host \

https://github.com/hyperledger/cacti/tree/main/packages/cactus-plugin-ledger-connector-besu#testing-api-calls-with-the-container

Acceptance Criteria

  1. Test coverage included for this case.
  2. Test verifies that API server does not crash in the mentioned use-case.
jagpreetsinghsasan commented 1 month ago

Looks like Orion isn't configured correctly in the cactus-besu-all-in-one:2024-07-04-8c030ae. @ruzell22 please check if the issue persists on the latest tag as well or not. If the error still persists, we will increase the scope of this task with a potential fix for the same.

UPDATE: I will create an issue to remove tessera, orion base images from the besu aio (besu mp has it and is used for private tx)

petermetz commented 1 month ago

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After Retry-After - HTTP | MDN The Retry-After response HTTP header indicates how long the user agent should wait before making a follow-up request. There are three main cases this header is used:

Exponential back-off - the delay between retries is increasing exponentially with every retry. So first you wait 2 seconds, 4, 8, etc.

petermetz commented 1 month ago

Let's say start with the run-transaction endpoint of the besu connector and do these things:

  1. Refactor the method within the connector so that it throw the specific kind of exception, let's say you create a new exception type called BesuLedgerInaccessibleException which is a sub-class of Error and you throw this when you detect the issue in the connector that a transaction was not possible to be executed because the network connection to the ledger is broken.
  2. Then in the HTTP endpoint implementation in the handleRequest method you can write some logic in the catch block that sets the retry-after headers (linked above) but it only should do this if the exception you caught (that was thrown by the connector) is an instanceof BesuLedgerInaccessibleException so that way we know for sure that this was the problem.
petermetz commented 2 weeks ago

https://github.com/web3/web3.js/issues/6900

petermetz commented 2 weeks ago

@ruzell22 I figured it out! The issue is deep within the web3 library itself that was causing the crashes. I pushed a commit to your branch with fixes that make it work so that it doesn't crash the entire nodejs process anymore when the connection is down to the ledger.

The other issue with the code was that you were having the re-try logic in the catch block within the endpoint request handler and this was trying to modify the http headers after the response was sent which was another thing that was ALSO crashing the entire nodejs process (which is why seemingly none of my fixes worked at first that were targeted at the web3 library itself).

So now the code is like this:

  1. yarn patch to web3-eth-acconts@1.6.1 so that it doesn't crash the nodejs process
  2. reverted your changes in the request handler so that those also don't crash the nodejs process
  3. in the current state it's not returning a 503 but a 500

What I need you to do to finish the task:

  1. Make sure that when the connection is down an http 503 code is returned (this part of your change was good)
  2. Add a static retry-after header value of 5 seconds (I'm dropping the requirement to have the exponential back-off for now because we've spent too much time on this task already so the exponential back-off is something we can address later)
  3. Add test cases asserting that when the ledger connection is down the API call fails with 503 and not a generic 500.

To verify the changes you make, take a look at this snippet (that I'll also add to the cmd-api-server readme). You can use this to mount your local changes on the besu connector code into the api server container and then configure the api server to not install from the registry but instead import the local package you have on your file-system (this is what I used to verify my fixes instead of publishing to npm every time)