gitpod-io / gitpod

The developer platform for on-demand cloud development environments to create software faster and more securely.
https://www.gitpod.io
GNU Affero General Public License v3.0
12.73k stars 1.22k forks source link

`server` stops sending events to `supervisor` occasionally #7054

Closed corneliusludmann closed 2 years ago

corneliusludmann commented 2 years ago

We observed in production that from time to time server stops sending events (instance updates) to supervisor.

We see this in the following symptoms:

I was not able to reproduce this in a core dev preview env.

Possible Cause

This could be because of this change: https://github.com/gitpod-io/gitpod/pull/6748

Impact

The ports get stuck issue is pretty serious since it looks for our users that the port is not ready (but in fact, it is exposed). They also don't get the port URL and buttons to make the ports public. From the IDE team, we are investigating a mitigation that asks for an update after port expose in case we don't see an update after some time (e.g. 1 minute). However, that doesn't six the actual cause of this problem.

geropl commented 2 years ago

/schedule

roboquat commented 2 years ago

@geropl: Issue scheduled in the meta team (WIP: 0)

In response to [this](https://github.com/gitpod-io/gitpod/issues/7054#issuecomment-986515394): >/schedule Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
geropl commented 2 years ago

/assign

geropl commented 2 years ago

Talked to @corneliusludmann :

Continue investigation with one example from @corneliusludmann from this morning: brown-panther-rxvqz0uw

geropl commented 2 years ago

@corneliusludmann Looks like a workspace issue. Will ask them to look into it as I have no clue how to look into k3s tracing/logs. :slightly_smiling_face:

csweichel commented 2 years ago

This is a workspace issue indeed. Will work on a fix.

/assign

csweichel commented 2 years ago

7079 does not fix this issue entirely. We correctly update the workspace instance status but the port view is still stuck.

geropl commented 2 years ago

/team meta

Taking over again an trying to reproduce

geropl commented 2 years ago

@csweichel @corneliusludmann What's a good way to reproduce? Maybe in devstaging? Started ~10 workspaces in prod and don't see it :eyes:

corneliusludmann commented 2 years ago

Unfortunately, I haven't found a way to reproduce this. :disappointed: It only happens in prod from time to time.

geropl commented 2 years ago

Ha, got lucky now! :four_leaf_clover:

geropl commented 2 years ago

From my testing it looks like - now that the workspace issue is fixed - this might be related to reconnects. I can now "fix" this by opening another port: if that succeeds, the update for the first port arrives as well, and resolves the sticky "detecting...".

I see ws-manager emits the correct event now. With honeycomb this would instantly solvable... We have the traces now, but jeager cannot filter nested structs :cry:

geropl commented 2 years ago

Will pause work here and wait until we have traces in honeycomb (tomorrow).

akosyakov commented 2 years ago

Will pause work here and wait until we have traces in honeycomb (tomorrow).

Do you think it makes sense to install mitigation in supervisor for future anyway?

geropl commented 2 years ago

Do you think it makes sense to install mitigation in supervisor for future anyway?

Not so much for this specific case. But if there is a general solution to be more robust against API failures that'd be interesting IMO.

geropl commented 2 years ago

Using honeycomb this morning to query failing openPort calls:

@csweichel the log entry above shows Error: 2 UNKNOWN: cannot update workspace pod: Operation cannot be fulfilled on pods "ws-0c0c12f6-e21e-44c3-8f85-91782097bf72": the object has been modified; please apply your changes to the latest version and try again. Is the workspace fix already deployed to prod?

csweichel commented 2 years ago

@csweichel the log entry above shows Error: 2 UNKNOWN: cannot update workspace pod: Operation cannot be fulfilled on pods "ws-0c0c12f6-e21e-44c3-8f85-91782097bf72": the object has been modified; please apply your changes to the latest version and try again. Is the workspace fix already deployed to prod?

Not yet, we're about to shift traffic to the new clusters today

akosyakov commented 2 years ago

Not so much for this specific case. But if there is a general solution to be more robust against API failures that'd be interesting IMO.

We did not observe API failures in the supervisor but missing events. We could try to force sync on reconnect and/or if port is not exposed within 1 min. But it makes thing complex. I see 2 issues:

geropl commented 2 years ago

@akosyakov I see plenty of failed openPort requests in the logs, which cause no port to be opened, and never trigger an update, at all. Does supervisor log those errors? Is maybe sth broken wrt error handling/transmission? Does it re-try on error?

I will try to reproduced now and have a concrete example with traces.

akosyakov commented 2 years ago

Yes, I think it should log them in following places:

geropl commented 2 years ago

Reproduced with another one of my workspaces (yellow-weasel-1s0rp7u2/dd931bcb-5aaf-4a25-a514-83744ee973d2). I tried to open ports in a row until it happened (8331, 8332, ... , 8334, 8335). Port 8334 was the last working one (13:56:46.000 CET), 8335 the broken (13:58:52.000 CET): image

Findings:

Number of forwarded clients directly relates to how many clients we (think) we have. So this really is about websocket connections, how we do bookkeeping in server and what we do on reconnect (e.g., not re-send latest uodate). have to quit for today, will prepare a PR tomorrow/later tonight.

akosyakov commented 2 years ago

supervisor received no update for the last call: https://cloudlogging.app.goo.gl/MFq6zLwEcPrQoohN7

Also there are no errors/warnings logged in supervisor?

geropl commented 2 years ago

Also there are no errors/warnings logged in supervisor?

Exactly. The traces I found are not super-helpful (because we cannot identify clients, and thus don't know for sure which one is connected or not: a good idea for the future would be for info in User-Agent or other headers).

akosyakov commented 2 years ago

a good idea for the future would be for info in User-Agent or other headers

https://github.com/gitpod-io/gitpod/issues/7166

geropl commented 2 years ago

@akosyakov @corneliusludmann Since https://github.com/gitpod-io/ops/pull/406 has been applied to prod clusters (yesterday, 16.12) I cannot reproduce this issue. I will close this for now. It would be awesome if one of you could cross-verify nonetheless. :slightly_smiling_face:

akosyakov commented 2 years ago

@geropl Just try to start as many ports as you can and see whether it happens? @iQQBot Could you try? If it does not happen then we should revert https://github.com/gitpod-io/gitpod/pull/7083 and close https://github.com/gitpod-io/gitpod/pull/7058 + https://github.com/gitpod-io/gitpod/issues/6778

geropl commented 2 years ago

I do it with curl lama.sh | LAMA_PORT=8331 sh and increment port with every call.

corneliusludmann commented 2 years ago

@geropl @akosyakov Happens to me right now again for workspace tomato-buzzard-nqdmbjxe and port 1234.

geropl commented 2 years ago

@corneliusludmann We don't have any (relevant) tracing data for tomato-buzzard-nqdmbjxe :cry:

Will try to reproduce again...

geropl commented 2 years ago

I cannot reproduce this. :confused:

During the two weeks we see one kind of openPort related error in the traces: operation not permitted: missing update permission on workspaceInstance (example trace)

As this is dragging on for rather long, resulted in a couple of fixes/improvements but is still hard to pinpoint I suggest the following:

@akosyakov @corneliusludmann D'accord? (please re-open if not)

akosyakov commented 2 years ago

I think @jeanp413 mentioned it yesterday.

corneliusludmann commented 2 years ago

@geropl For me, it still happens from time to time that ports get stuck in detecting. Right no it's workspace copper-fox-anmqze3a and port 8080 that is affected.

jeanp413 commented 2 years ago

yeah it happened to me last week (or the week before that), even restarting the workspace (force stop and start the workspace again) didn't fixed it, eventually it's fixes by itself after some time of the workspace being stopped

geropl commented 2 years ago

yeah it happened to me last week (or the week before that)

Sorry, I don't see how I could act upon these "anecdotal" hints. :confused: I at least need a workspaceId/instanceId and a time range of ~1h to find sth.

For me, it still happens from time to time that ports get stuck in detecting. Right no it's workspace copper-fox-anmqze3a and port 8080 that is affected.

Besides other (maybe related?) errors regarding rate-limiter and "missing update permission" I don't see anything regarding this in either traces or logging (logs). :shrug:

Again, I don't want to play this down, nor avoid investigation or anything, but IMO we sunk enough time into this already with manual testing and fiddling around. The only way forward here is to have a bigger dataset from traces or logs from the caller side (supervisor).

geropl commented 2 years ago

Re-opening because of new reports:

akosyakov commented 2 years ago

While working on JetBrains gateway plugin I also noticed that events are very often delayed and lost. Maybe though affected by performance of prev envs itself.

geropl commented 2 years ago

very often delayed and lost

@akosyakov It think it make sense if we meet again (soon) and come up with a strategy to pin down the root cause. The data from server (cmp. #7550) suggests that client and server do not agree upon who is connected and who not.

jeanp413 commented 2 years ago

@geropl FYI I got this workspace https://gold-tuna-qx6w1yza.ws-us25.gitpod.io/ right now and it can't detect the running ports

geropl commented 2 years ago

@jeanp413 I can find the trace for your case, but cannot see anything wrong. :upside_down_face: I'm preparing a PR to get more insights from the client side. Also, we get more conclusive insights to (maybe) rule-out server behavior tomorrow.

geropl commented 2 years ago

Assigning to team IDE for the work on PR: https://github.com/gitpod-io/gitpod/pull/7574

Gaberoids commented 2 years ago

Whats the current workaround for this?

When I used to get "8000 detecting.." I used to "stop" the workspace and "open" again. This would get me the 8000 port woking fine.

However, now (around Jan 11th,12th, 13th), when I stop then open the workspace, I don't get any port opened at all, neither "8000 detecting.." shows anymore.

multiple workspaces are having this behavior:

It seems like open a new workspace from github, fix the issue, but if there is a way to resurrect the ports in the current workspaces mentioned above, it would be great.

Thank you

geropl commented 2 years ago

Whats the current workaround for this?

@Gaberoids I use to open/close another port, this fixed it for me.

However, now, when I stop then open the workspace, I don't get any port opened at all,

What do you mean with "now"? "since today", "since last week", "right now"? (trying to pinpoint a release) Thx!

Gaberoids commented 2 years ago

Whats the current workaround for this?

@Gaberoids I use to open/close another port, this fixed it for me.

However, now, when I stop then open the workspace, I don't get any port opened at all,

What do you mean with "now"? "since today", "since last week", "right now"? (trying to pinpoint a release) Thx!

I found your workaround useful. I flipped a port to "private" and it make the 8000 pending status to change to work just fine. This workaround is useful when there are ports to manipulate. "Now" (around Jan 11th,12th, 13th), when I stop then open the workspace, sometimes I don't get any port opened at all, neither "8000 detecting.." shows anymore.

Since this case has been closed with a fix, I am assuming this case was not related to my problem since my problem has not been fixed. For this reason I am using issue #7158 to continue describing the issue.

New finding about how to reproduce the issue: I am fairly confident that my issue is reproducible when my workspace closes for inactivity. This explain why the issue was intermittent. For example:

0- Open workspace:

1- On the terminal type: python3 manage.py runserver

2- Let the workspace to shutdown due to inactivity.

3- On the same browser tag when the step 2 happened, click the button to reopen the workspace or reload the page.

4- Run step 1 again.

5- -Unexpected: I see port status as "8000 detecting.. "port 8000 does not work. Site cannot be opened on a new tab. OR no ports are showing on the "Remote explorer" space (see image below) -Expected: I should be able to open the site using port 8000.

...

iQQBot commented 2 years ago

@Gaberoids You problem is relate https://github.com/gitpod-io/gitpod/pull/7621, but it haven't deploy on production Please wait this deploy

iQQBot commented 2 years ago

Also, please don't post workspace URLs with share enabled in public, as this will leak your Git information.

iQQBot commented 2 years ago

@Gaberoids this commit is deployed, please try it again, Thanks!