gitpod-io / gitpod

The developer platform for on-demand cloud development environments to create software faster and more securely.
https://www.gitpod.io
GNU Affero General Public License v3.0
13.01k stars 1.24k forks source link

[code-browser] observability of extension installations and seach #11608

Closed akosyakov closed 2 years ago

akosyakov commented 2 years ago

OpenVSX proxy provides some isolation to us against OpenVSX incidents. Unfortunately we don't really know to which extent. We need to have analytics on errors and latencies of extension installations and search from the perspective of a user. VS Code already provides such telemetry we need to use prometheus push gateway endpoint of the supervisor to observe it. We could start by counting errors on these operations, i.e. add gitpod_code_extension_action_count metric with the following labels:

Later we should move reporting to IDE proxy and push all errors there as well for analytics in GCP error reporting, but it is blocked on https://github.com/gitpod-io/gitpod/issues/11134 right now

jeanp413 commented 2 years ago

This look redundant to add in vscode for me :thinking:

user-failure is invalid args, missing extensions and so on

We can track this in openvsx proxy if resultCount is 0, but this looks more analytics related than observability related

gitpod-failure if OpenVSX or proxy is not responsive

If openvsx is down we already have these metrics and the alert, why do we need another metric?.

bugs on our side in VS Code

Can you give some examples? I don't see a reason why we would need to modify extension query code in vscode, if there's some upstream change then that should be caught in our insiders and during smoke test, deploying vscode with a breaking change does not makes sense

akosyakov commented 2 years ago

The point here that we don't actually know that OpenVSX proxy is helping. New alert only indicates that there are issues with OpenVSX. We need an alert which says that there are issues on our side, or OpenVSX proxy is managing. I reassigned to me, since I'm on-call this week.

jeanp413 commented 2 years ago

The point here that we don't actually know that OpenVSX proxy is helping.

Isn't that what served responses from backup cache graph tells us when openvsx is down? if it wasn't helping then it wouldn't be returning any response from backup as all queries will be a cache miss or do you mean you don't trust the responses from the cache :thinking:?

akosyakov commented 2 years ago

Isn't that what served responses from backup cache graph tells us when openvsx is down?

You maybe get a response one request, but another failed, so total installation operation failed. We would like to understand reliability of user operations. During last incident it was showing 15% but I could search and install, it failed very rare. I could not understand for long time why it is so, till we figured out that requests were not from VS Code at all. A graph which shows that users can search and install extensions 99% will clearly communicate impact.