radicle-dev / radicle-upstream

Desktop client for Radicle.
Other
615 stars 51 forks source link

Network Introspection #1745

Closed alexjg closed 3 years ago

alexjg commented 3 years ago

Recently we've encountered a number of problems with project replication. These have been challenging to debug, partly because we do not have an easy way of introspecting the network state. We do have logs but these are only useful when you have a very specific hypothesis about what the problem might be, i.e they don't lend themselves to exploring and looking for things that don't match your assumptions. Therefore I think it would be worthwhile to spend some time building tools for examining the network state.

"network state" here refers to two things:

  1. The waiting room: What requests the system knows about and what their status is.
  2. The HyParView cluster membership: what the active and passive set of peers is.

In both cases we are not just interested in the current state but also the events that have occurred and how they have affected the state. For example, when debugging a buggy state transition in the waiting room I want to ask questions like "what events have occurred related to requests for <some URN> and what was the state of the waiting room before and after those events.

I suggest a sort of power user feature in upstream, accessible via a hotkey in the same manner as the Design System screen, call it the "network panel" or something. I imagine this as a screen with two panels, one for the waiting room and one for the cluster state. Each panel would have a table of events with some filters (e.g filtering by URN or peer ID for the waiting room), each event would have a before and after state attached to it.

To make these events available to upstream we will need to add additional events to the websocket event stream which already exists. We don't want to produce them all the time as there are likely to be a lot of them. We also don't want to have a command line flag or similar to produce events as this requires a restart which will make it harder to observe intermittent problems. Therefore we will need a runtime configuration option which can be enabled by a toggle on the network panel.

xla commented 3 years ago

Great outline and the premise is clear. Couple of thoughts:

alexjg commented 3 years ago

To points 1 and 3: I definitely think there is more information we could be surfacing and I think there are better ways of surfacing it than as a simple event log. I don't know exactly what those are though, so instead of trying to think through that now I'm thinking that if we just get a simple event log of the waiting room built, then as we encounter more questions we can iterate. I'm new to the network stuff though so if anyone has better suggestions than just an event log I would love to hear them.

Regarding 2, my worry was around memory usage of upstream as it's expected to be running for a long time. However, having thought about it a little more I guess we can just bound the number of events we store in upstream. In that case the toggle on the network panel would just determine whether upstream is storing events, not whether they are being produced.

rudolfs commented 3 years ago

@alexjg can we close this now that https://github.com/radicle-dev/radicle-upstream/pull/1758 has been merged?

alexjg commented 3 years ago

Yep