Open tsenart opened 5 years ago
/cc @felixfbecker @ijsnow @nicksnyder @keegancsmith
This looks interesting, on the self-hosted route: https://github.com/peardeck/prometheus-user-metrics
And on the managed options: https://raygun.com/platform/real-user-monitoring
If we were to do this, I think focusing on one or two metrics would be the best way to start (e.g. hover tooltip latency, search latency).
One challenge is that our primary focus is performance at our customers, and we generally can't automatically report data back. It would be ideal if we could capture this data in a general way and allow sites to configure where to send it (or build it into the product).
Collecting this data on sourcegraph.com is useful too since performance there impacts how non-customers perceive Sourcegraph, but it does have unique performance characteristics that don't necessarily translate to enterprise (e.g. search index is disabled).
One challenge is that our primary focus is performance at our customers, and we generally can't automatically report data back. It would be ideal if we could capture this data in a general way and allow sites to configure where to send it (or build it into the product).
If this performance data is sent from the browser extension, wouldn't it apply to both sourcegraph.com and private installations?
I was assuming that we were talking about our web app, but yeah, we could theoretically track hover tooltip times from the browser extension (ideally bucketed by language) across public and private code. @dadlerj would you see any problems with this type of data collection?
We explicitly never track any user activity data from the browser extension, and we make bug reporting (Sentry) opt-in:
We would need to use the same user flow for performance tracking (adding another checkbox, or maybe just making that one more general). Even something as small as referrer URLs leaking (which typically include repo names, filenames, etc) would not be okay.
Even something as small as referrer URLs leaking (which typically include repo names, filenames, etc) would not be okay.
If we guarantee none of that information is leaking by design, could we have this be opt-out?
Then I'd personally be fine with it! It'd be a product/eng team question at that point @sqs @ijsnow
I like the idea!
How will we differentiate metrics derived from the page the browser extension is running in and the browser extension itself?
It seems to me, in the world of extensions, a lot of the time that will be taken into account will be when extension code is executing (3rd party) rather than our own browser extension/extension host code. Have you considered that? Should we come up with an extension that implements all features in the extension API and run benchmarks against that instead? I'm concerned that the information we get from this won't be all that useful as it will mostly be coming from extensions.
You are right, most data in the hover tooltips come from extensions.
A given hover request might get data from multiple extensions, and it would be great to track hover tooltip load time per extension.
Mechanically, do we wait for all providers to return before showing the hover tooltip, or do we re-render as results are added?
I believe we re-render as more are added.
Please post an update or close this with an explanation if it is no longer relevant. This will be closed automatically if there is no more activity.
It's very much still relevant Mr. Stalebot.
@felixfbecker do you still feel a need for this? Haven't seen any explicit feedback from customers about speed for extension hovers in my time here so far so not sure if we made some infra improvements along the way since this was created two years ago.
@sourcegraph/code-intel what do you think?
@Joelkw haven't seen that piece of feedback either so far. We can always reopen if necessary. I'll tag @efritz just in case he wants to add any historical context I might be missing.
We're not currently tracking any code intel latencies in telemetry, so I don't think there's any necessity for this on our side at the moment.
Sorry, how exactly can we know what users are experiencing if we don't capture these end metrics? We can't rely on people reporting things. Many will just quit Sourcegraph in frustration and never say anything about it. I think we need this data, absolutely. And not necessarily in pings, but in our monitoring and tracing infrastructure.
@tsenart, I would definitely be curious to see the data if we started collecting it, but right now we have to keep this at a priority level of "unless we're getting active feedback it's bad or we have other reason to believe it's bad (usage dropoff, our own manual tests, etc), creating the monitoring to ensure it isn't silently bad is low priority relative to things we have active signal are valuable" at least for the web team. Not opposed to collecting this data in the future or otherwise prioritizing if that information changes.
In order to triage issues as belonging to the extension/extension host rather than the extension code itself (the classic battle of "this looks like a code intel problem" by virtue of codeintel extensions being enabled everywhere) if we were to have the latency of the entire interaction as well as the latency of the extension's requests/computation as a comparison?
Just discovered this from Sentry, we may utilize this feature since we're already using Sentry (but needs to upgrade our plan 😞 ).
Background
Real User (Performance) Monitoring telemetry extends our ability to understand the user experience from the user perspective by capturing browser performance metrics such as DNS latency, Network transfer latency, DOM rendering, CSS repaints, Garbage Collection pauses, etc. The metrics available to us on the server side are not enough.
We already have Sentry to capture errors that happen in the browser extensions. We should look into integrating a RUM service to help us establish a real user performance baseline.
Once we have visibility into this baseline and the outliers, it'd be beneficial to come up with an internal SLO (Service Level Objective) for critical user operations. (e.g. hover load time 99p < 1s)
Slow performance, specially in the first interactions, is a big deterrent for new users to adopt Sourcegraph. Having an explicit SLO to measure and track is meant to manage that user satisfaction in regards to performance.