usagedb: "active channels" is not up-to-date

warner commented 6 years ago

We currently sample the count of active channels on a 5-minute periodic timer, and write the results into the usagedb. The munin plugin comes along (on a 5-minute cronjob) and samples the usagedb, putting the results into a graph.

A brief connection that gets sampled will appear to last for 5 minutes. A brief connection that doesn't get sampled won't appear to exist at all.

We could make the usagedb data more accurate by proactively updating the database each time a connection is made, and again when it is dropped. We'd still need the 5-minute timer to refresh the "updated" time (so a server crash, which stops updating the connection counts but also stops updating the "updated" timestamp, doesn't make it look like those connections are still established). But we wouldn't need to sample the connection counts during that timer.

This would increase the usagedb workload: two additional UPDATEs per connection. Currently we perform one INSERT per paired connection (when two peers find each other, we write the usage record after the first of them disconnect), or one INSERT per non-paired connection (when a peer fails the handshake process, or gives up without finding their peer), plus the 5-minute timer UPDATE (which in fact is currently a DELETE plus an INSERT).

OTOH, munin is only sampling every 5 minutes anyways, so this couldn't improve the accuracy of the presented data. It'd only help if you had some other sort of monitoring tool, which samples more frequently than the 5-minute timer.

warner commented 6 years ago

There's another reason to update the database upon each connection-shutdown: the incomplete_bytes counter will double-count traffic in the 5-minute sampling window just after shutdown.

The current.incomplete_bytes row is filled with the sum of the byte counts from all active connections. When both sides of a connection terminate, the byte count is written into a new row of the usage table (usage.total_bytes), but current.incomplete_bytes doesn't get modified until the next timer cycle. If munin samples the database between these two events, we'll see the "incomplete" count be artifically higher, because the same bytes will be included in both usage.total_bytes and the stale current.incomplete_bytes. The probability of this depends upon the phase relationship between munin's 5-minute cron-driven timer (e.g. 1:00pm, 1:05pm, 1:10pm, etc) and the server's internal 5-minute timer. The worst case is when the server is launched just after munin samples, e.g. at 1:00:01pm, and this double-counting will happen for nearly every connection. Best case is when it is launched just before the sample (e.g. 1:04:59pm), where it will almost never double-count.

warner commented 6 years ago

.. except, we already do that database update. recordUsage() calls _update_stats() before doing the db.commit. But the munin graph is showing a spike just when the connection closes. There's clearly something else going on here that I haven't figured out yet.

warner commented 6 years ago

ah, ok, there are two connection termination events. Whoever disconnects first causes the usage record to be written, but the list of active connections isn't touched until the second disconnect happens. As as result, the _update_stats() (written after the first disconnect) writes out both the usage record and the incomplete_bytes data. A moment later, when (usually) the second connection is dropped, the count gets removed from the in-memory active-connections list, but that doesn't touch the database until the next timer event.

One fix would be to not write the usage data until the second connection is dropped. A weaker fix would be to call _update_stats() after both connection-drop events: that would only double-count data if the second connection lingered for a while.

I'm a bit reluctant to delay the usage record until the second write, because I think there might be clients that forget to disconnect, and those lingering connections will persist until I restart the server for some other reason, and then the usage record won't get written at all.

I guess we should change _dump_stats() to only use incomplete_bytes for connections that have both sides connected.

magic-wormhole / magic-wormhole-transit-relay

usagedb: "active channels" is not up-to-date #4