hypercore-protocol / hyperdrive-daemon

Hyperdrive, batteries included.
MIT License
156 stars 23 forks source link

Conversation: Which stats should we be logging? #40

Open andrewosh opened 4 years ago

andrewosh commented 4 years ago

@zootella mentioned in the last Dat meeting that we should have a conversation about:

  1. What kind of telemetry info should we collect?
  2. How to do we make sure that that telemetry info is informative while not leaking any sensitive info?
  3. Should (and if so how) should we make the collected data available?

We currently have very rudimentary telemetry in the daemon, but we haven't yet had a conversation about exactly what things would both be appropriate to collect and useful for future optimization.

Currently we're reporting:

  1. The hashed daemon token (to track anonymized identity across restarts).
  2. The total number of hypercores that the corestore has in memory.
  3. The total number of peers that the swarm networker is connected to.

Ideally, we want to collect things like latency numbers as well, and perhaps other network-related stats.

What do y'all think? @mafintosh @pfrazee

pfrazee commented 4 years ago

Lots to dig into here, the only observation I have right now is anomaly stats might be useful, things like “an operation took long under X condition.”

zootella commented 4 years ago

Seeing the histogram of how long operations take would be really interesting @pfrazee

Telemetry can be useful at every level of the stack, I think, including all the way on the top: At the product level where the user clicks to complete a task and then waits and gets (or doesn't get) the desired result. How long did they wait for the result to start? to complete? What percent of the time does the user cancel or exit before a success or failure?

We're not at all interested in the behavior of our users or details about their data (Unlike all the big centralized platforms). To collect stats with minimal privacy impact we could have nodes report:

A goal at this super high level would be to measure how well the technology works deployed, as real people and apps use it, with their real data and imperfect consumer hardware and internet connections. It would be great to be able to track user success version to version (Upgrade now: we're measuring dat: links load twice as reliably as before!) Changes beyond our control on the public internet will affect these metrics also, but that's not a bad thing: If some sudden or gradual external change (with a large ISP, with a Windows Update), makes things start working twice or half as well, that's something we should know about.

da2x commented 4 years ago
andrewosh commented 4 years ago

Thanks all -- at the end of the day, we decided to drop the telemetry a few weeks back when the daemon moved out of the "beta" stage. Ultimately we'd prefer it to be opt-in, but in that case collecting large-scale stats (like IPv6 addresses @da2x) wouldn't make much sense, as we wouldn't get a large-scale picture.

Instead we've opted to ensure we don't log any keys in ~/.hyperdrive/log.json, so that we can collect those as necessary to fix issues. We're also periodically logging perf-related stats now, which will be useful for debugging. We're hoping certain aggregate stats can be pulled directly from the DHT (like an approximate DHT size).

@zootella Strongly agree with your last point, that we do need a way to measure if sudden/gradual changes have negative effects. Think for now we'll just rely on users opening issues, while making sure that the log files give us the info we need. Given that, I'll update the issue title to be "Which stats should we be logging?" as the suggestions here are equally relevant.