Closed akosyakov closed 2 years ago
This look redundant to add in vscode for me :thinking:
user-failure is invalid args, missing extensions and so on
We can track this in openvsx proxy if resultCount is 0, but this looks more analytics related than observability related
gitpod-failure if OpenVSX or proxy is not responsive
If openvsx is down we already have these metrics and the alert, why do we need another metric?.
bugs on our side in VS Code
Can you give some examples? I don't see a reason why we would need to modify extension query code in vscode, if there's some upstream change then that should be caught in our insiders and during smoke test, deploying vscode with a breaking change does not makes sense
The point here that we don't actually know that OpenVSX proxy is helping. New alert only indicates that there are issues with OpenVSX. We need an alert which says that there are issues on our side, or OpenVSX proxy is managing. I reassigned to me, since I'm on-call this week.
The point here that we don't actually know that OpenVSX proxy is helping.
Isn't that what served responses from backup cache
graph tells us when openvsx is down? if it wasn't helping then it wouldn't be returning any response from backup as all queries will be a cache miss or do you mean you don't trust the responses from the cache :thinking:?
Isn't that what served responses from backup cache graph tells us when openvsx is down?
You maybe get a response one request, but another failed, so total installation operation failed. We would like to understand reliability of user operations. During last incident it was showing 15% but I could search and install, it failed very rare. I could not understand for long time why it is so, till we figured out that requests were not from VS Code at all. A graph which shows that users can search and install extensions 99% will clearly communicate impact.
OpenVSX proxy provides some isolation to us against OpenVSX incidents. Unfortunately we don't really know to which extent. We need to have analytics on errors and latencies of extension installations and search from the perspective of a user. VS Code already provides such telemetry we need to use prometheus push gateway endpoint of the supervisor to observe it. We could start by counting errors on these operations, i.e. add
gitpod_code_extension_action_count
metric with the following labels:install
|search
user-failure
is invalid args, missing extensions and so ongitpod-failure
if OpenVSX or proxy is not responsive or bugs on our side in VS CodeLater we should move reporting to IDE proxy and push all errors there as well for analytics in GCP error reporting, but it is blocked on https://github.com/gitpod-io/gitpod/issues/11134 right now