Open andrewosh opened 4 years ago
Lots to dig into here, the only observation I have right now is anomaly stats might be useful, things like “an operation took long under X condition.”
Seeing the histogram of how long operations take would be really interesting @pfrazee
Telemetry can be useful at every level of the stack, I think, including all the way on the top: At the product level where the user clicks to complete a task and then waits and gets (or doesn't get) the desired result. How long did they wait for the result to start? to complete? What percent of the time does the user cancel or exit before a success or failure?
We're not at all interested in the behavior of our users or details about their data (Unlike all the big centralized platforms). To collect stats with minimal privacy impact we could have nodes report:
A goal at this super high level would be to measure how well the technology works deployed, as real people and apps use it, with their real data and imperfect consumer hardware and internet connections. It would be great to be able to track user success version to version (Upgrade now: we're measuring dat:
links load twice as reliably as before!) Changes beyond our control on the public internet will affect these metrics also, but that's not a bad thing: If some sudden or gradual external change (with a large ISP, with a Windows Update), makes things start working twice or half as well, that's something we should know about.
Thanks all -- at the end of the day, we decided to drop the telemetry a few weeks back when the daemon moved out of the "beta" stage. Ultimately we'd prefer it to be opt-in, but in that case collecting large-scale stats (like IPv6 addresses @da2x) wouldn't make much sense, as we wouldn't get a large-scale picture.
Instead we've opted to ensure we don't log any keys in ~/.hyperdrive/log.json
, so that we can collect those as necessary to fix issues. We're also periodically logging perf-related stats now, which will be useful for debugging. We're hoping certain aggregate stats can be pulled directly from the DHT (like an approximate DHT size).
@zootella Strongly agree with your last point, that we do need a way to measure if sudden/gradual changes have negative effects. Think for now we'll just rely on users opening issues, while making sure that the log files give us the info we need. Given that, I'll update the issue title to be "Which stats should we be logging?" as the suggestions here are equally relevant.
@zootella mentioned in the last Dat meeting that we should have a conversation about:
We currently have very rudimentary telemetry in the daemon, but we haven't yet had a conversation about exactly what things would both be appropriate to collect and useful for future optimization.
Currently we're reporting:
Ideally, we want to collect things like latency numbers as well, and perhaps other network-related stats.
What do y'all think? @mafintosh @pfrazee