corneliusludmann commented 2 years ago

We observed in production that from time to time server stops sending events (instance updates) to supervisor.

We see this in the following symptoms:

Ports get stuck in detecting state because we don't get an instance update in expose-ports.go (issue #6778). Once this happens in a workspace, this does not recover anymore and all subsequently exposed ports are affected. These workspaces seem to be always also affected by the following symptom.
Stopping workspaces don't show the proper stopping page because we don't get an update in supervisor frontend index.ts (issue #6931)

I was not able to reproduce this in a core dev preview env.

Possible Cause

This could be because of this change: https://github.com/gitpod-io/gitpod/pull/6748

Impact

The ports get stuck issue is pretty serious since it looks for our users that the port is not ready (but in fact, it is exposed). They also don't get the port URL and buttons to make the ports public. From the IDE team, we are investigating a mitigation that asks for an update after port expose in case we don't see an update after some time (e.g. 1 minute). However, that doesn't six the actual cause of this problem.

geropl commented 2 years ago

/schedule

roboquat commented 2 years ago

@geropl: Issue scheduled in the meta team (WIP: 0)

In response to [this](https://github.com/gitpod-io/gitpod/issues/7054#issuecomment-986515394): >/schedule Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

geropl commented 2 years ago

/assign

geropl commented 2 years ago

Talked to @corneliusludmann :

seems not to be related to websocket reconnects because:
- it's persistent per workspace (instance). if that'd be the case subsequent actions (open another port) should lead to resolution of all ports hanging in "detecting...". That's not the case.
server event distribution seems to be working as expected. Only potential semantic change: push last event on connection creation
maybe sever high event loop lag is related (?)

Continue investigation with one example from @corneliusludmann from this morning: brown-panther-rxvqz0uw

geropl commented 2 years ago

Leads to error: cannot update workspace pod (link): race condition in ws-manager
could be the reason as it turns up rather often: logs

@corneliusludmann Looks like a workspace issue. Will ask them to look into it as I have no clue how to look into k3s tracing/logs. :slightly_smiling_face:

csweichel commented 2 years ago

This is a workspace issue indeed. Will work on a fix.

/assign

csweichel commented 2 years ago

7079 does not fix this issue entirely. We correctly update the workspace instance status but the port view is still stuck.

geropl commented 2 years ago

/team meta

Taking over again an trying to reproduce

geropl commented 2 years ago

@csweichel @corneliusludmann What's a good way to reproduce? Maybe in devstaging? Started ~10 workspaces in prod and don't see it :eyes:

corneliusludmann commented 2 years ago

Unfortunately, I haven't found a way to reproduce this. :disappointed: It only happens in prod from time to time.

geropl commented 2 years ago

Ha, got lucky now! :four_leaf_clover:

geropl commented 2 years ago

From my testing it looks like - now that the workspace issue is fixed - this might be related to reconnects. I can now "fix" this by opening another port: if that succeeds, the update for the first port arrives as well, and resolves the sticky "detecting...".

I see ws-manager emits the correct event now. With honeycomb this would instantly solvable... We have the traces now, but jeager cannot filter nested structs :cry:

geropl commented 2 years ago

Will pause work here and wait until we have traces in honeycomb (tomorrow).

akosyakov commented 2 years ago

Will pause work here and wait until we have traces in honeycomb (tomorrow).

Do you think it makes sense to install mitigation in supervisor for future anyway?

geropl commented 2 years ago

Do you think it makes sense to install mitigation in supervisor for future anyway?

Not so much for this specific case. But if there is a general solution to be more robust against API failures that'd be interesting IMO.

geropl commented 2 years ago

Using honeycomb this morning to query failing openPort calls:

query: https://ui.honeycomb.io/gitpod/datasets/gitpod-production/result/93PvwnA64wM
span: https://ui.honeycomb.io/gitpod/datasets/gitpod-production/result/93PvwnA64wM/trace/CndzDxaUZJ?span=90fe5498dd20b6bd (missing the original error message: will create follow up PR)
corrolate with logs:
- cannot update workspace pod: https://console.cloud.google.com/logs/query;cursorTimestamp=2021-12-09T08:41:59.730Z;query=resource.type%3D%22k8s_container%22%0A%22Request%20openPort%22%0A%22amethyst-buzzard-3yo2yacz%22%0Atimestamp%3D%222021-12-09T08:41:59.730Z%22%0AinsertId%3D%221wihjy3p4ztswwba%22;summaryFields=jsonPayload%252Fpayload%252Fargs%252F0:false:32:beginning;timeRange=PT1H?project=gitpod-191109
- Error: Unknown workspace manager "us01": https://console.cloud.google.com/logs/query;cursorTimestamp=2021-12-09T08:26:55.640Z;query=resource.type%3D%22k8s_container%22%0A%22Request%20openPort%22%0A-jsonPayload.error:%22Error:%202%20UNKNOWN:%20cannot%20update%20workspace%20pod%22%0Atimestamp%3D%222021-12-09T08:26:40.948Z%22%0AinsertId%3D%22fze56yti0qupvr0p%22;summaryFields=jsonPayload%252Fpayload%252Fargs%252F0:false:32:beginning;timeRange=PT1H?project=gitpod-191109 Looks like it could be related to this bug: https://github.com/gitpod-io/ops/pull/406 /cc @princerachit

@csweichel the log entry above shows Error: 2 UNKNOWN: cannot update workspace pod: Operation cannot be fulfilled on pods "ws-0c0c12f6-e21e-44c3-8f85-91782097bf72": the object has been modified; please apply your changes to the latest version and try again. Is the workspace fix already deployed to prod?

csweichel commented 2 years ago

@csweichel the log entry above shows Error: 2 UNKNOWN: cannot update workspace pod: Operation cannot be fulfilled on pods "ws-0c0c12f6-e21e-44c3-8f85-91782097bf72": the object has been modified; please apply your changes to the latest version and try again. Is the workspace fix already deployed to prod?

Not yet, we're about to shift traffic to the new clusters today

akosyakov commented 2 years ago

Not so much for this specific case. But if there is a general solution to be more robust against API failures that'd be interesting IMO.

We did not observe API failures in the supervisor but missing events. We could try to force sync on reconnect and/or if port is not exposed within 1 min. But it makes thing complex. I see 2 issues:

we need to ensure that we don't spam with getWorkspace
we don't read old state by mistake, i.e. we force sync see that port is exposed, but then some delay event come that port is not exposed, could something like that happen or it should be eventually consistent?

geropl commented 2 years ago

@akosyakov I see plenty of failed openPort requests in the logs, which cause no port to be opened, and never trigger an update, at all. Does supervisor log those errors? Is maybe sth broken wrt error handling/transmission? Does it re-try on error?

I will try to reproduced now and have a concrete example with traces.

akosyakov commented 2 years ago

Yes, I think it should log them in following places:

geropl commented 2 years ago

Reproduced with another one of my workspaces (yellow-weasel-1s0rp7u2/dd931bcb-5aaf-4a25-a514-83744ee973d2). I tried to open ports in a row until it happened (8331, 8332, ... , 8334, 8335). Port 8334 was the last working one (13:56:46.000 CET), 8335 the broken (13:58:52.000 CET):

Findings:

no openPort errors for that workspace in that timerange: logs
supervisor received no update for the last call: https://cloudlogging.app.goo.gl/MFq6zLwEcPrQoohN7
trace for 8334: https://ui.honeycomb.io/gitpod/datasets/gitpod-production/result/CpdYf2tDrmd/trace/dVCj7uihryY?span=e7155870f816b78b
- note how often we forward that update to clients (4x, 4x, 2x)
trace for 8335: https://ui.honeycomb.io/gitpod/datasets/gitpod-production/result/CpdYf2tDrmd/trace/odhbKfjJjuV?span=aac1d3ca634c4e4e
- note how it's only forwarded (4x, 1x, 1x)

Number of forwarded clients directly relates to how many clients we (think) we have. So this really is about websocket connections, how we do bookkeeping in server and what we do on reconnect (e.g., not re-send latest uodate). have to quit for today, will prepare a PR tomorrow/later tonight.

akosyakov commented 2 years ago

supervisor received no update for the last call: https://cloudlogging.app.goo.gl/MFq6zLwEcPrQoohN7

Also there are no errors/warnings logged in supervisor?

geropl commented 2 years ago

Also there are no errors/warnings logged in supervisor?

Exactly. The traces I found are not super-helpful (because we cannot identify clients, and thus don't know for sure which one is connected or not: a good idea for the future would be for info in User-Agent or other headers).

akosyakov commented 2 years ago

a good idea for the future would be for info in User-Agent or other headers

https://github.com/gitpod-io/gitpod/issues/7166

geropl commented 2 years ago

@akosyakov @corneliusludmann Since https://github.com/gitpod-io/ops/pull/406 has been applied to prod clusters (yesterday, 16.12) I cannot reproduce this issue. I will close this for now. It would be awesome if one of you could cross-verify nonetheless. :slightly_smiling_face:

akosyakov commented 2 years ago

@geropl Just try to start as many ports as you can and see whether it happens? @iQQBot Could you try? If it does not happen then we should revert https://github.com/gitpod-io/gitpod/pull/7083 and close https://github.com/gitpod-io/gitpod/pull/7058 + https://github.com/gitpod-io/gitpod/issues/6778

geropl commented 2 years ago

I do it with curl lama.sh | LAMA_PORT=8331 sh and increment port with every call.

corneliusludmann commented 2 years ago

@geropl @akosyakov Happens to me right now again for workspace tomato-buzzard-nqdmbjxe and port 1234.

geropl commented 2 years ago

@corneliusludmann We don't have any (relevant) tracing data for tomato-buzzard-nqdmbjxe :cry:

Will try to reproduce again...

geropl commented 2 years ago

I cannot reproduce this. :confused:

During the two weeks we see one kind of openPort related error in the traces: operation not permitted: missing update permission on workspaceInstance (example trace)

As this is dragging on for rather long, resulted in a couple of fixes/improvements but is still hard to pinpoint I suggest the following:

we improve logging in supervisor to print a specific message exactly for this case (already done?)
we close this issue
once this happens again (verified by log entry): we create a new issue around that log entry (linking this one). The more data points we get the higher the likelihood that we get full traces for one of those cases.

@akosyakov @corneliusludmann D'accord? (please re-open if not)

akosyakov commented 2 years ago

I think @jeanp413 mentioned it yesterday.

corneliusludmann commented 2 years ago

@geropl For me, it still happens from time to time that ports get stuck in detecting. Right no it's workspace copper-fox-anmqze3a and port 8080 that is affected.

jeanp413 commented 2 years ago

yeah it happened to me last week (or the week before that), even restarting the workspace (force stop and start the workspace again) didn't fixed it, eventually it's fixes by itself after some time of the workspace being stopped

geropl commented 2 years ago

yeah it happened to me last week (or the week before that)

Sorry, I don't see how I could act upon these "anecdotal" hints. :confused: I at least need a workspaceId/instanceId and a time range of ~1h to find sth.

For me, it still happens from time to time that ports get stuck in detecting. Right no it's workspace copper-fox-anmqze3a and port 8080 that is affected.

Besides other (maybe related?) errors regarding rate-limiter and "missing update permission" I don't see anything regarding this in either traces or logging (logs). :shrug:

Again, I don't want to play this down, nor avoid investigation or anything, but IMO we sunk enough time into this already with manual testing and fiddling around. The only way forward here is to have a bigger dataset from traces or logs from the caller side (supervisor).

geropl commented 2 years ago

Re-opening because of new reports:

7553: Workspace not revealing
7550: ports not opening

akosyakov commented 2 years ago

While working on JetBrains gateway plugin I also noticed that events are very often delayed and lost. Maybe though affected by performance of prev envs itself.

geropl commented 2 years ago

very often delayed and lost

@akosyakov It think it make sense if we meet again (soon) and come up with a strategy to pin down the root cause. The data from server (cmp. #7550) suggests that client and server do not agree upon who is connected and who not.

jeanp413 commented 2 years ago

@geropl FYI I got this workspace https://gold-tuna-qx6w1yza.ws-us25.gitpod.io/ right now and it can't detect the running ports

geropl commented 2 years ago

@jeanp413 I can find the trace for your case, but cannot see anything wrong. :upside_down_face: I'm preparing a PR to get more insights from the client side. Also, we get more conclusive insights to (maybe) rule-out server behavior tomorrow.

geropl commented 2 years ago

Assigning to team IDE for the work on PR: https://github.com/gitpod-io/gitpod/pull/7574

Gaberoids commented 2 years ago

Whats the current workaround for this?

When I used to get "8000 detecting.." I used to "stop" the workspace and "open" again. This would get me the 8000 port woking fine.

However, now (around Jan 11th,12th, 13th), when I stop then open the workspace, I don't get any port opened at all, neither "8000 detecting.." shows anymore.

multiple workspaces are having this behavior:

It seems like open a new workspace from github, fix the issue, but if there is a way to resurrect the ports in the current workspaces mentioned above, it would be great.

Thank you

geropl commented 2 years ago

Whats the current workaround for this?

@Gaberoids I use to open/close another port, this fixed it for me.

However, now, when I stop then open the workspace, I don't get any port opened at all,

What do you mean with "now"? "since today", "since last week", "right now"? (trying to pinpoint a release) Thx!

Gaberoids commented 2 years ago

Whats the current workaround for this?

@Gaberoids I use to open/close another port, this fixed it for me.

However, now, when I stop then open the workspace, I don't get any port opened at all,

What do you mean with "now"? "since today", "since last week", "right now"? (trying to pinpoint a release) Thx!

I found your workaround useful. I flipped a port to "private" and it make the 8000 pending status to change to work just fine. This workaround is useful when there are ports to manipulate. "Now" (around Jan 11th,12th, 13th), when I stop then open the workspace, sometimes I don't get any port opened at all, neither "8000 detecting.." shows anymore.

Since this case has been closed with a fix, I am assuming this case was not related to my problem since my problem has not been fixed. For this reason I am using issue #7158 to continue describing the issue.

New finding about how to reproduce the issue: I am fairly confident that my issue is reproducible when my workspace closes for inactivity. This explain why the issue was intermittent. For example:

0- Open workspace:

...

1- On the terminal type: python3 manage.py runserver

2- Let the workspace to shutdown due to inactivity.

3- On the same browser tag when the step 2 happened, click the button to reopen the workspace or reload the page.

4- Run step 1 again.

5- -Unexpected: I see port status as "8000 detecting.. "port 8000 does not work. Site cannot be opened on a new tab. OR no ports are showing on the "Remote explorer" space (see image below) -Expected: I should be able to open the site using port 8000.

...

iQQBot commented 2 years ago

@Gaberoids You problem is relate https://github.com/gitpod-io/gitpod/pull/7621, but it haven't deploy on production Please wait this deploy

iQQBot commented 2 years ago

Also, please don't post workspace URLs with share enabled in public, as this will leak your Git information.

iQQBot commented 2 years ago

@Gaberoids this commit is deployed, please try it again, Thanks!

gitpod-io / gitpod

`server` stops sending events to `supervisor` occasionally #7054

Possible Cause

Impact

7079 does not fix this issue entirely. We correctly update the workspace instance status but the port view is still stuck.

7553: Workspace not revealing

7550: ports not opening