ConnectionReset, message: "Connection reset by peer"

CyberCyclone commented 2 years ago

I'm getting the below error while running oura in daemon mode for a long time. I'm guessing it's not handling a peer disconnection.

[2022-04-21T19:13:01Z ERROR pallas_multiplexer] Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }                  │
thread '<unnamed>' panicked at 'explicit panic', /cargo/registry/src/github.com-1ecc6299db9ec823/pallas-multiplexer-0.8.0/src/lib.rs:99:17    │
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace                                                                 │
[2022-04-21T19:13:01Z ERROR pallas_multiplexer] Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }                                     │
thread '<unnamed>' panicked at 'explicit panic', /cargo/registry/src/github.com-1ecc6299db9ec823/pallas-multiplexer-0.8.0/src/lib.rs:74:17    │
thread 'thread '<unnamed><unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: RecvError' panicked at '', called `Result::unw  │
thread 'main' panicked at 'error in pipeline thread: Any { .. }', src/bin/oura/daemon.rs:289:23                                               │
exec error: Error: Command failed: ./oura-1.3.1 daemon --config daemon.toml --cursor 911142,eb74846314f4389a12affc0c9bb664453dbb6013fcb62172  │
[2022-04-21T19:13:01Z ERROR pallas_multiplexer] Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }                  │
thread '<unnamed>' panicked at 'explicit panic', /cargo/registry/src/github.com-1ecc6299db9ec823/pallas-multiplexer-0.8.0/src/lib.rs:99:17    │
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace                                                                 │
[2022-04-21T19:13:01Z ERROR pallas_multiplexer] Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }                                     │
thread '<unnamed>' panicked at 'explicit panic', /cargo/registry/src/github.com-1ecc6299db9ec823/pallas-multiplexer-0.8.0/src/lib.rs:74:17    │
thread 'thread '<unnamed><unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: RecvError' panicked at '', called `Result::unw  │
thread 'main' panicked at 'error in pipeline thread: Any { .. }', src/bin/oura/daemon.rs:289:23

rvcas commented 2 years ago

idk if we should handle this in pallas or oura

CyberCyclone commented 2 years ago

The other thing I was thinking in terms of error handling in oura, is it expected that it will eventually make it through to the sink target so it can handle it?

As in, an error event?

scarmuega commented 2 years ago

yes, this is tricky.

I usually run Oura as a Pod in a Kubernetes cluster. By panicking the process, the whole Pod restarts and I can control the retry / backoff policy using common k8s settings.

I understand that not every implementation will have an orchestration framework available to deal with these errors. One thing I'm certain, any retry mechanism needs to be configurable by the final user and not hidden as part of the implementation. The worst scenario is when you have a zombie process that isn't doing anything but not dying either.

Off the top of my head, I suggest:

a retry mechanism similar to the one we have during bootstrap, configurable via the daemon.toml file (with max retries and backoff delays).
treat this as an "Oura" concern (not Pallas) so that we can use the existing config / code artifacts.
add health metrics to the existing prometheus exporter so that we can monitor the state of the connection.

Adding a "disconnect" event is an option too. I thought about adding these type of "flow control" events for other use cases too (like stopping the pipeline), but I believe that we should wait until v2 so that we do it without adding complexity to the current event model.

nalane commented 2 years ago

I work at MLabs, and we would love to use Oura for our projects, but this is actually a major issue that's preventing us from committing to it wholesale. I would love to contribute to solving this problem if we could identify a good way to go about it, either here or in Pallas

scarmuega commented 2 years ago

I've actually been working on a draft version of a retry mechanism for the node-to-client source. I need to update it given some recent changes on main, but shouldn't take long.

I would appreciate some help evaluating the solution and maybe battle-testing it a little bit.

@CyberCyclone @nalane just to prioritize accordingly, do you plan on using N2C for your implementation?

nalane commented 2 years ago

I do, personally

CyberCyclone commented 2 years ago

I'm currently using N2N as it's easier to maintain. My current implementation is using NodeJS to control Oura and pass to the DB cluster. I'm in the process of implementing a "rapid" sync by running a dozen Oura instances and connect to 10+ cardano-node servers to then populate the DB from different block points. So for me N2N is the best way to do this.

scarmuega commented 2 years ago

FYI, almost done with this features, working draft in #332.

N2N done, still working on N2C.

scarmuega commented 1 year ago

@CyberCyclone @nalane the automatic retry is available since v1.5. Both N2N and N2C will attempt to reconnect (and handshake) if the bearer closes abruptly. The retry uses an exponential backoff delay (1s, 2s, 4s, etc).

There are two new config values that control the logic:

chainsync_max_retries: defines the amount of retry attempts before exiting the pipeline. Defaults to 50. chainsync_max_backoff: the max amount of seconds to wait between retries. Defaults to 60s.

txpipe / oura

ConnectionReset, message: "Connection reset by peer" #256