Closed corneliusludmann closed 2 years ago
/schedule
@geropl: Issue scheduled in the meta team (WIP: 0)
/assign
Talked to @corneliusludmann :
sever
high event loop lag is related (?)Continue investigation with one example from @corneliusludmann from this morning: brown-panther-rxvqz0uw
This is a workspace issue indeed. Will work on a fix.
/assign
/team meta
Taking over again an trying to reproduce
@csweichel @corneliusludmann What's a good way to reproduce? Maybe in devstaging? Started ~10 workspaces in prod and don't see it :eyes:
Unfortunately, I haven't found a way to reproduce this. :disappointed: It only happens in prod from time to time.
Ha, got lucky now! :four_leaf_clover:
From my testing it looks like - now that the workspace issue is fixed - this might be related to reconnects. I can now "fix" this by opening another port: if that succeeds, the update for the first port arrives as well, and resolves the sticky "detecting...".
I see ws-manager
emits the correct event now. With honeycomb this would instantly solvable... We have the traces now, but jeager cannot filter nested structs :cry:
Will pause work here and wait until we have traces in honeycomb (tomorrow).
Will pause work here and wait until we have traces in honeycomb (tomorrow).
Do you think it makes sense to install mitigation in supervisor for future anyway?
Do you think it makes sense to install mitigation in supervisor for future anyway?
Not so much for this specific case. But if there is a general solution to be more robust against API failures that'd be interesting IMO.
Using honeycomb this morning to query failing openPort
calls:
cannot update workspace pod
: https://console.cloud.google.com/logs/query;cursorTimestamp=2021-12-09T08:41:59.730Z;query=resource.type%3D%22k8s_container%22%0A%22Request%20openPort%22%0A%22amethyst-buzzard-3yo2yacz%22%0Atimestamp%3D%222021-12-09T08:41:59.730Z%22%0AinsertId%3D%221wihjy3p4ztswwba%22;summaryFields=jsonPayload%252Fpayload%252Fargs%252F0:false:32:beginning;timeRange=PT1H?project=gitpod-191109Error: Unknown workspace manager "us01"
: https://console.cloud.google.com/logs/query;cursorTimestamp=2021-12-09T08:26:55.640Z;query=resource.type%3D%22k8s_container%22%0A%22Request%20openPort%22%0A-jsonPayload.error:%22Error:%202%20UNKNOWN:%20cannot%20update%20workspace%20pod%22%0Atimestamp%3D%222021-12-09T08:26:40.948Z%22%0AinsertId%3D%22fze56yti0qupvr0p%22;summaryFields=jsonPayload%252Fpayload%252Fargs%252F0:false:32:beginning;timeRange=PT1H?project=gitpod-191109
Looks like it could be related to this bug: https://github.com/gitpod-io/ops/pull/406 /cc @princerachit @csweichel the log entry above shows Error: 2 UNKNOWN: cannot update workspace pod: Operation cannot be fulfilled on pods "ws-0c0c12f6-e21e-44c3-8f85-91782097bf72": the object has been modified; please apply your changes to the latest version and try again
. Is the workspace fix already deployed to prod?
@csweichel the log entry above shows
Error: 2 UNKNOWN: cannot update workspace pod: Operation cannot be fulfilled on pods "ws-0c0c12f6-e21e-44c3-8f85-91782097bf72": the object has been modified; please apply your changes to the latest version and try again
. Is the workspace fix already deployed to prod?
Not yet, we're about to shift traffic to the new clusters today
Not so much for this specific case. But if there is a general solution to be more robust against API failures that'd be interesting IMO.
We did not observe API failures in the supervisor but missing events. We could try to force sync on reconnect and/or if port is not exposed within 1 min. But it makes thing complex. I see 2 issues:
@akosyakov I see plenty of failed openPort
requests in the logs, which cause no port to be opened, and never trigger an update, at all. Does supervisor log those errors? Is maybe sth broken wrt error handling/transmission? Does it re-try on error?
I will try to reproduced now and have a concrete example with traces.
Yes, I think it should log them in following places:
Reproduced with another one of my workspaces (yellow-weasel-1s0rp7u2
/dd931bcb-5aaf-4a25-a514-83744ee973d2
). I tried to open ports in a row until it happened (8331, 8332, ... , 8334, 8335). Port 8334 was the last working one (13:56:46.000 CET
), 8335 the broken (13:58:52.000 CET
):
Findings:
openPort
errors for that workspace in that timerange: logssupervisor
received no update for the last call: https://cloudlogging.app.goo.gl/MFq6zLwEcPrQoohN7Number of forwarded clients directly relates to how many clients we (think) we have. So this really is about websocket connections, how we do bookkeeping in server
and what we do on reconnect (e.g., not re-send latest uodate). have to quit for today, will prepare a PR tomorrow/later tonight.
supervisor received no update for the last call: https://cloudlogging.app.goo.gl/MFq6zLwEcPrQoohN7
Also there are no errors/warnings logged in supervisor?
Also there are no errors/warnings logged in supervisor?
Exactly. The traces I found are not super-helpful (because we cannot identify clients, and thus don't know for sure which one is connected or not: a good idea for the future would be for info in User-Agent
or other headers).
a good idea for the future would be for info in User-Agent or other headers
@akosyakov @corneliusludmann Since https://github.com/gitpod-io/ops/pull/406 has been applied to prod clusters (yesterday, 16.12) I cannot reproduce this issue. I will close this for now. It would be awesome if one of you could cross-verify nonetheless. :slightly_smiling_face:
@geropl Just try to start as many ports as you can and see whether it happens? @iQQBot Could you try? If it does not happen then we should revert https://github.com/gitpod-io/gitpod/pull/7083 and close https://github.com/gitpod-io/gitpod/pull/7058 + https://github.com/gitpod-io/gitpod/issues/6778
I do it with curl lama.sh | LAMA_PORT=8331 sh
and increment port with every call.
@geropl @akosyakov Happens to me right now again for workspace tomato-buzzard-nqdmbjxe
and port 1234
.
@corneliusludmann We don't have any (relevant) tracing data for tomato-buzzard-nqdmbjxe
:cry:
Will try to reproduce again...
I cannot reproduce this. :confused:
During the two weeks we see one kind of openPort
related error in the traces: operation not permitted: missing update permission on workspaceInstance
(example trace)
As this is dragging on for rather long, resulted in a couple of fixes/improvements but is still hard to pinpoint I suggest the following:
supervisor
to print a specific message exactly for this case (already done?)@akosyakov @corneliusludmann D'accord? (please re-open if not)
I think @jeanp413 mentioned it yesterday.
@geropl For me, it still happens from time to time that ports get stuck in detecting. Right no it's workspace copper-fox-anmqze3a
and port 8080 that is affected.
yeah it happened to me last week (or the week before that), even restarting the workspace (force stop and start the workspace again) didn't fixed it, eventually it's fixes by itself after some time of the workspace being stopped
yeah it happened to me last week (or the week before that)
Sorry, I don't see how I could act upon these "anecdotal" hints. :confused: I at least need a workspaceId/instanceId and a time range of ~1h to find sth.
For me, it still happens from time to time that ports get stuck in detecting. Right no it's workspace copper-fox-anmqze3a and port 8080 that is affected.
Besides other (maybe related?) errors regarding rate-limiter and "missing update permission" I don't see anything regarding this in either traces or logging (logs). :shrug:
Again, I don't want to play this down, nor avoid investigation or anything, but IMO we sunk enough time into this already with manual testing and fiddling around. The only way forward here is to have a bigger dataset from traces or logs from the caller side (supervisor
).
Re-opening because of new reports:
While working on JetBrains gateway plugin I also noticed that events are very often delayed and lost. Maybe though affected by performance of prev envs itself.
very often delayed and lost
@akosyakov It think it make sense if we meet again (soon) and come up with a strategy to pin down the root cause. The data from server (cmp. #7550) suggests that client and server do not agree upon who is connected and who not.
@geropl FYI I got this workspace https://gold-tuna-qx6w1yza.ws-us25.gitpod.io/ right now and it can't detect the running ports
@jeanp413 I can find the trace for your case, but cannot see anything wrong. :upside_down_face: I'm preparing a PR to get more insights from the client side. Also, we get more conclusive insights to (maybe) rule-out server
behavior tomorrow.
Assigning to team IDE for the work on PR: https://github.com/gitpod-io/gitpod/pull/7574
Whats the current workaround for this?
When I used to get "8000 detecting.." I used to "stop" the workspace and "open" again. This would get me the 8000 port woking fine.
However, now (around Jan 11th,12th, 13th), when I stop then open the workspace, I don't get any port opened at all, neither "8000 detecting.." shows anymore.
multiple workspaces are having this behavior:
It seems like open a new workspace from github, fix the issue, but if there is a way to resurrect the ports in the current workspaces mentioned above, it would be great.
Thank you
Whats the current workaround for this?
@Gaberoids I use to open/close another port, this fixed it for me.
However, now, when I stop then open the workspace, I don't get any port opened at all,
What do you mean with "now"? "since today", "since last week", "right now"? (trying to pinpoint a release) Thx!
Whats the current workaround for this?
@Gaberoids I use to open/close another port, this fixed it for me.
However, now, when I stop then open the workspace, I don't get any port opened at all,
What do you mean with "now"? "since today", "since last week", "right now"? (trying to pinpoint a release) Thx!
I found your workaround useful. I flipped a port to "private" and it make the 8000 pending status to change to work just fine. This workaround is useful when there are ports to manipulate. "Now" (around Jan 11th,12th, 13th), when I stop then open the workspace, sometimes I don't get any port opened at all, neither "8000 detecting.." shows anymore.
Since this case has been closed with a fix, I am assuming this case was not related to my problem since my problem has not been fixed. For this reason I am using issue #7158 to continue describing the issue.
New finding about how to reproduce the issue: I am fairly confident that my issue is reproducible when my workspace closes for inactivity. This explain why the issue was intermittent. For example:
0- Open workspace:
1- On the terminal type: python3 manage.py runserver
2- Let the workspace to shutdown due to inactivity.
3- On the same browser tag when the step 2 happened, click the button to reopen the workspace or reload the page.
4- Run step 1 again.
5- -Unexpected: I see port status as "8000 detecting.. "port 8000 does not work. Site cannot be opened on a new tab. OR no ports are showing on the "Remote explorer" space (see image below) -Expected: I should be able to open the site using port 8000.
...
@Gaberoids You problem is relate https://github.com/gitpod-io/gitpod/pull/7621, but it haven't deploy on production Please wait this deploy
Also, please don't post workspace URLs with share enabled in public, as this will leak your Git information.
@Gaberoids this commit is deployed, please try it again, Thanks!
We observed in production that from time to time
server
stops sending events (instance updates) tosupervisor
.We see this in the following symptoms:
detecting
state because we don't get an instance update inexpose-ports.go
(issue #6778). Once this happens in a workspace, this does not recover anymore and all subsequently exposed ports are affected. These workspaces seem to be always also affected by the following symptom.index.ts
(issue #6931)I was not able to reproduce this in a core dev preview env.
Possible Cause
This could be because of this change: https://github.com/gitpod-io/gitpod/pull/6748
Impact
The ports get stuck issue is pretty serious since it looks for our users that the port is not ready (but in fact, it is exposed). They also don't get the port URL and buttons to make the ports public. From the IDE team, we are investigating a mitigation that asks for an update after port expose in case we don't see an update after some time (e.g. 1 minute). However, that doesn't six the actual cause of this problem.