Closed n2vi closed 3 months ago
Change https://go.dev/cl/534976 mentions this issue: main.star: add openbsd-ppc64, linux-riscv64, freebsd-riscv64 builders
Thanks. Here's the resulting certificate: openbsd-ppc64-n2vi-1697128325.cert.txt.
I've mailed CLs to define your new builder in LUCI and will comment once that's done.
Thank you; I confirm that using the cert I get a plausible looking luci_machine_tokend/token.json.
I have not read the code yet to diagnose this; leaving assigned to me.
2023/10/13 18:29:39 Bootstrapping the swarming bot with certificate authentication 2023/10/13 18:29:39 retrieving the luci-machine-token from the token file 2023/10/13 18:29:39 Downloading the swarming bot 2023/10/13 18:29:39 Starting the swarming bot /home/swarming/.swarming/swarming_bot.zip 72354 2023-10-13 18:29:47.331 E: ts_mon monitoring is disabled because the endpoint provided is invalid or not supported: 72354 2023-10-13 18:29:48.890 E: Request to https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake failed with HTTP status code 403: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake 72354 2023-10-13 18:29:48.891 E: Failed to contact for handshake, retrying in 0 sec...
I don't see anything in the code or logs here that help me diagnose. It just looks like the server didn't like the token.json that had been refreshed just a minute before.
Maybe someone there can check server-side luci logs? Unable to reassign to dmitshur; hope someone there sees this.
Thanks for the update.
I recall there was a similar looking error in https://github.com/golang/go/issues/61666#issuecomment-1706534225. We'll take a look.
In case it helps... I set both -token-file-path on the bootstrapswarm command line and also LUCI_MACHINE_TOKEN in the environment. The logs don't indicate any trouble reading the token.json file, though they're not very explicit.
I appreciate that there have been serious security flaws in the past from too-detailed error messages. But I'd venture that it is safe for luci to say more than "403".
I recognize I'm a guinea pig for the Go LUCI stuff, so happy to give you a login on t.n2vi.com if you would find it easier to debug directly or hop on a video call with screen sharing.
Finally, I recognize I'm a newcomer to Go Builders. So it could well be user error here.
Thanks for your patience as we work through this and smooth out the builder onboarding process.
I set both -token-file-path on the bootstrapswarm command line and also LUCI_MACHINE_TOKEN in the environment.
To confirm, are both of them set to the same value, which is the file path location of the token.json file? If you don't mind experimenting on your side, you can check if anything is different if you leave LUCI_MACHINE_TOKEN unset and instead rely on the default location for your OS (/var/lib/luci_machine_tokend/token.json
I believe).
We'll keep looking into this on our side. Though next week we might be somewhat occupied by a team event, so please expect some delays. Thanks again.
Yes, both are set to the same value /home/luci/luci_machine_tokend/token.json. (My OS doesn't have /var/lib and anyway not a fan of leaving cleartext credentials in obscure corners of the filesystem.)
This morning I've retried the same invocation of bootstrapswarm as before and don't get the 403 Client Error. So maybe there was just a transient issue.
Happy to set this effort on the shelf for a week or two; enjoy the team event!
CC @golang/release.
Over the last week I tried swarm a few more times with no problems, so whatever issue I saw before indeed seems transient. I never saw swarm do any actual work, presumably because some server-side table is still pointing to my machine as in the old-builder state rather than new-builder. Fine by me.
I'll have limited ability to work on it from November 8 - 20, but happy to work on it during the next few days if you're waiting on me.
The builder is currently in a "Quarantined—Had 6 consecutive BOT_DIED tasks" state. @n2vi Can you please restart the swarming bot on your side and see if that's enough to get it out of that state?
We've applied changes on our side (e.g., CL 546715) that should help avoid this repeating, but it's possible more work will be needed. Let's see what happens after you restart it next time. Thanks.
Restarted.
There is a message from swarming_bot urllib3 that they only support openssl but this is compiled with libressl. From the GitHub issue cited, I think this is merely a warning but if you're otherwise stuck I can investigate deeper.
Or if there is something else I can do to help, just ask.
FWIW, on my other openbsd-ppc64 machine, I've succeeded in compiling go1.22-devel from tip without change. Both machines are running what OpenBSD calls -current, i.e. compiled from source as of a few days ago.
Thanks. I saw it came back up in idle state. I gave it some work, and I see it failed with:
Could not resolve version infra/tools/cipd/openbsd-ppc64:git_revision:ec494f363fdfd8cdd5926baad4508d562b7353d4: no such package: infra/tools/cipd/openbsd-ppc64
That's useful information and on us to fix. Specifically, we need to set things up for CIPD packages to be built for the openbsd/ppc64 platform. (That will involve somewhat similar to crrev.com/c/5086069, with the caveat it can't be done using e.g. Go 1.21.0 since it doesn't support openbsd/ppc64 yet, whereas Go 1.22.0 will work.)
We'll update this issue once that's done.
Any news on this?
t.n2vi.net aka host-openbsd-ppc64-n2vi had been getting kernel panics from the gopher buildlet stream so, amidst power outages and other troubles, I brought software up to date in an effort to debug. Go 1.22 is running fine here.
Currently the buildlet fails to compile and while I work on that I thought I'd check in parallel on LUCI progress. If we can just cut over to the new system maybe I don't need to debug the old one?
Thanks for checking in.
I've looked into this and as things stand now, there is expected additional time after the public Go 1.22.0 release, before it's available for the CIPD package building pipeline to use. This delay might decrease in the future, and we might be able to work around it, but not yet.
We'll update this issue after the CIPD packages are ready. Thanks.
@dmitshur any update on this given the hard deadline for May 17th?
@n2vi Can you try again?
I'm traveling, but will try remotely soon.
Just to confirm: you'd like both the old builder and the new swarming running in parallel for now, correct?
Killed off the old processes running as swarm, rebuilt golang.org/x/build/cmd/bootstrapswarm@latest, and tried to restart but got an error message about flag -token-file-path provided but not defined.
In the morning I'll start digging into what has changed in the API since October.
As long as you still have LUCI_MACHINE_TOKEN
set as needed in the environment, you should be able to drop the -token-file-path
flag. It used to be required to set both to the same value, and the flag was removed in CL 548955 in favor of the env var.
Thanks. Restarted without error messages.
Is there any work remaining before we close this issue?
Looking at the build history at https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-openbsd-ppc64?limit=200, it seems it hasn't had a chance to run with the latest changes. @n2vi Can you try restarting it again? Thanks.
That link tells me "You don't have the permission to view the machine pool" so not sure I have the ability to test much.
The most recent output from bootstrapswarm was on May 4: Downloading the swarming bot status code 401
I don't see any change from git in bootstrapswarm.go since February. Which are the "latest changes" you're referring to?
When I restart bootstrapswarm, I see the usual startup messages. If there is some dashboard I can access or other instructions I should follow, please let me know.
Sorry, I was referring to the "Ended Builds" section of that page, which you should be able to see.
Thanks for restarting the bot. I see a new recent build b8748292982685499841, and it failed to run "cipd ensure".
Errors:
failed to resolve golang/bootstrap-go/openbsd-ppc64@1.21.0 (line 9): no such package: golang/bootstrap-go/openbsd-ppc64
The problem is that we haven't placed a bootstrap for openbsd/ppc64 yet (because it's a new port). This should be quick for us to fix. I'll work on it.
Change https://go.dev/cl/584975 mentions this issue: main.star: use go1.22.0 as bootstrap for openbsd/ppc64
Here is the latest message from the swarming bot on openbsd-ppc64-n2vi:
28106 2024-05-13 02:23:34.925 E: Unable to open given url, https://chromium-swarm.appspot.com/swarming/api/v1/bot/poll, after 1 attempts or 240 timeout.
500 Server Error: Internal Server Error for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/poll
----------
Alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Content-length: 228
Content-type: text/html; charset=UTF-8
Date: Mon, 13 May 2024 02:23:34 GMT
Server: Google Frontend
500 Internal Server Error
The server has either erred or is incapable of performing the requested operation.
----------
28106 2024-05-13 02:23:34.925 E: Swarming poll error: Failed to contact server
Does that error happen only occasionally or does it persist even after restarts?
If the former, please try it again now that CL 584975 has landed. It should resolve the error we saw in https://github.com/golang/go/issues/63480#issuecomment-2105269618 on Friday.
Ok, killed the bootstrapswarm and python processes, restarted with the usual (benign?) startup messages at 15:57:05 UTC, but see nothing more yet from the bootstrapswarm output. The Builders dashboard for INFRA says "Task did not start, no resource". I don't know if we're waiting for something to restart on a server somewhere or what.
I feel like I'm just stumbling around in the dark here. It might be you only need to point me to some doc that expands on "try it again now". The full LUCI documentation from chromium.org is a bit much to digest since I don't have much server side insight.
I see that the builder is now working on the build https://ci.chromium.org/b/8748038859820104945, which is looking promising. The "Task did not start, no resource" entries you saw are slightly older, from before you restarted it.
That build got pretty far, up to "ok runtime 66.123s" with tests passing. But it seems to have failed due to being out of memory afterwards. It also took about 11 min to run make.bash, and 28 min to run (partial) tests. Other openbsd builders have slow multipliers set here, it seems we should do something similar here if this performance is working as intended.
Change https://go.dev/cl/585217 mentions this issue: main.star: delete openbsd/ppc64 builders on release-branch.go1.21
Sure, set the multiplier if you want. Power9 processors are not inherently slow, but no telling what layers of inefficiency there may be in all this. The same 8-core machine is running bootstrapswarm and the gopher buildlet; to me, the load average of 14 to 18 does seem like a lot.
We're not actually out of RAM; the machine has 23GB free at the moment. But there may be some ulimits that need to be adjusted somewhere. I did configure the login class to give these more than openbsd's default small values but I'm not sure how much you need.
Thanks. I think you should let the LUCI version of the builder run for some time, and when it seems stable, feel free to stop the coordinator instance on your side to free up the resources. The only reason to keep the coordinator instance is if you're not quite ready to switch yet, but it needs to happen at some point since the coordinator will be going away.
I'll update CL 585217 to give it a timeout scale for now, especially since it's running builds for both LUCI and coordinator, and we can adjust it later on as it becomes more clear what the optimal value is.
As of 18:10 UTC, rebooted openbsd-ppc64-n2vi with datasize-max=8192M for swarming. If 8GB of RAM is not enough we have other problems. Did not restart gopher buildlet yet. Let's see how high it ramps up with nothing but swarming.
This eventually panic'd the kernel with an allocation failure. Restarting now (20:22 UTC) to see how reproducible this is.
{But the tests are not automatically restarting. "Retry Build" button on the Builder Dashboard is gray'd out for me; perhaps someone there can kick it?}
Still wasn't seeing anything running, so killed off the python and bootstrapswarm processes and restarted. They just report status code 401 and immediately exit. I'm not illuminated by looking at the Ended Builds list either.
Thanks for working on this.
I'm not illuminated by looking at the Ended Builds list either.
I failed to realizer this sooner, but our configuration intends to make it possible for you to see the machine pool (see "poolViewer" granted to group "all" here).
I believe it currently requires you to sign in (any account will work), then you can view the contents of the "Machine Pool" links such as https://chromium-swarm.appspot.com/botlist?f=cipd_platform%3Aopenbsd-ppc64&f=pool%3Aluci.golang.shared-workers. You should see something like this:
And clicking on the bot name will take you to https://chromium-swarm.appspot.com/bot?id=openbsd-ppc64-n2vi where you'll find more information about its current state from LUCI's perspective. Apologies about the additional overhead at this time to get to this information.
Since you've done some restarts, it might help to confirm that luci_machine_tokend process still working as described in step 4 of https://go.dev/wiki/DashboardBuilders#how-to-set-up-a-builder-1, and that the token file it writes to has new content, which is propagated to bootstrapswarm.
If that isn't where the problem is, is there more information included in the status code 401 message, beyond "Downloading the swarming bot" and "status code 401"? Also, is there more useful context in the local swarming bot log?
sign in (any account will work) Thanks, that was a crucial clue. I'd tried signing in before, but was put off by the "grant write access to all your git repositories" warning. Doing it with a less powerful account is fine.
Now that the bot is getting work again we'll see if we can reproduce the pagedaemon kernel panic. Not that I'm a kernel developer by any means, but gotta learn sometime! I recognize that this is a sufficiently unusual platform and workload that it is not inconceivable that we step on a new corner case.
No kernel crashes yet, just running all the way to Failure. :)
I'm still trying to understand more about the build output, in particular the details of what "resource temporarily unavailable" means specifically. Is it running into a user process limit for forking? The login.conf here sets maxproc-max=256, maxproc-cur=128. Do the tests need more processes than that?
One probably unrelated item caught my eye: /var/log/secure reports
May 15 20:48:40 t doas: command not permitted for swarming: chmod 0777 /home/swarming/.swarming/w/ir/x/t/go-build3552889220
All those files are already owned by user "swarming" so why would the software be trying to become root? I do recall seeing (and being horrified by) all.bash trying to become root. That's when I switched to only running make.bash on most of my machines. It is ok here on t.n2vi.net=openbsd-ppc64-n2vi for you to be root if you have to; I'm assuming arbitrarily bad stuff may happen when running a builder machine. Just let me know if you really need it.
Overnight, we captured another kernel panic that closely resembles the earlier one. I'll get back to you when I make progress on this; may be quite a while. LUCI appropriately marks me as offline for the duration.
status update; no need to respond...
Found a recent patch to openbsd powerpc64 pagedaemon pmac.c that may be relevant, so upgraded t.n2vi.net from -stable to -snapshot.
Now the previously-ok luci_machine_tokend dumps core with a pinsyscalls error on the console, so rebuilt with the nineteen-line install sequence from https://pkg.go.dev/go.chromium.org/luci and a freshly compiled go1.22.3. This now seems to be generating a new token.json ok.
Rebuilt and restarted bootstrapswarm. The LUCI Builders dashboard shows the machine now as Idle; based on past experience, in an hour or two it will actually start delivering work without further attention. I'll periodically monitor to be sure that happens, and then over the next couple days we'll see if the kernel panic re-occurs.
I do suspect we're stepping on a pagedaemon bug that occasionally crashes the machine, but it is getting LUCI work enough done that perhaps Gophers can make their own independent progress while I pursue the OpenBSD issue.
On Fri, May 17, 2024, 16:48 Eric Grosse @.***> wrote:
status update; no need to respond...
Found a recent patch to openbsd powerpc64 pagedaemon pmac.c that may be relevant, so upgraded t.n2vi.net from -stable to -snapshot.
Now the previously-ok luci_machine_tokend dumps core with a pinsyscalls error on the console, so rebuilt with the nineteen-line install sequence from https://pkg.go.dev/go.chromium.org/luci and a freshly compiled go1.22.3. This now seems to be generating a new token.json ok.
Rebuilt and restarted bootstrapswarm. The LUCI Builders dashboard shows the machine now as Idle; based on past experience, in an hour or two it will actually start delivering work without further attention. I'll periodically monitor to be sure that happens, and then over the next couple days we'll see if the kernel panic re-occurs.
— Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/63480#issuecomment-2118498598, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACADPOXZT27J2RRAXAVRBXLZC2JL7AVCNFSM6AAAAAA5ZVVAZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGQ4TQNJZHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Restarted swarm with twice the process ulimit. Let's see if that reduces the number of fork/exec failures. [for the record: maxproc-max=512, maxproc-cur=256 suffices]
No recent kernel crashes.
My builder machine is fine, no crashes, but I see that the dashboard thinks it is offline. Here is a tail -50 nohup.out. I believe the ball is back in your court...
Traceback (most recent call last):
File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 1831, in rbe_poll
self._rbe_session = remote_client.RBESession(
File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 901, in __init__
resp = remote.rbe_create_session(dimensions, bot_version,
File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 587, in rbe_create_session
raise RBEServerError('Failed to create RBE session, see bot logs')
bot_code.remote_client_errors.RBEServerError: Failed to create RBE session, see bot logs
38455 2024-05-22 18:26:05.889 E: Unable to open given url, https://chromium-swarm.appspot.com/swarming/api/v1/bot/rbe/session/create, after 1 attempts or 240 timeout.
429 Client Error: Too Many Requests for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/rbe/session/create
----------
Alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Content-length: 284
Content-type: text/plain; charset=utf-8
Date: Wed, 22 May 2024 18:26:05 GMT
Server: Google Frontend
rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Create Bot Session requests per project' and limit 'Create Bot Session requests per project per minute per region' of service 'remotebuildexecution.googleapis.com' for consumer 'project_number:575346572923'.
----------
38455 2024-05-22 18:26:05.889 E: Failed to open RBE Session: Failed to create RBE session, see bot logs
Traceback (most recent call last):
File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 1831, in rbe_poll
self._rbe_session = remote_client.RBESession(
File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 901, in __init__
resp = remote.rbe_create_session(dimensions, bot_version,
File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 587, in rbe_create_session
raise RBEServerError('Failed to create RBE session, see bot logs')
bot_code.remote_client_errors.RBEServerError: Failed to create RBE session, see bot logs
38455 2024-05-22 18:26:09.744 E: Unable to open given url, https://chromium-swarm.appspot.com/swarming/api/v1/bot/rbe/session/create, after 1 attempts or 240 timeout.
429 Client Error: Too Many Requests for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/rbe/session/create
----------
Alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Content-length: 284
Content-type: text/plain; charset=utf-8
Date: Wed, 22 May 2024 18:26:09 GMT
Server: Google Frontend
rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Create Bot Session requests per project' and limit 'Create Bot Session requests per project per minute per region' of service 'remotebuildexecution.googleapis.com' for consumer 'project_number:575346572923'.
----------
38455 2024-05-22 18:26:09.744 E: Failed to open RBE Session: Failed to create RBE session, see bot logs
Traceback (most recent call last):
File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/bot_main.py", line 1831, in rbe_poll
self._rbe_session = remote_client.RBESession(
File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 901, in __init__
resp = remote.rbe_create_session(dimensions, bot_version,
File "/home/swarming/.swarming/swarming_bot.1.zip/bot_code/remote_client.py", line 587, in rbe_create_session
raise RBEServerError('Failed to create RBE session, see bot logs')
bot_code.remote_client_errors.RBEServerError: Failed to create RBE session, see bot logs
The error message above includes "quota exceeded". It seems to have been temporary. Looking at https://ci.chromium.org/ui/p/golang/g/port-openbsd-ppc64/builders, the builder seems to be stable and passing in the main Go repo and all golang.org/x repos. Congratulations on reaching this point!
Would you like to remove its known issue as the next step?
We got another pager daemon kernel crash last night. I'm glad we'r getting substantial test runs done, but we're not out of the woods yet.
I see a "context deadline exceeded" failure in the latest build. Not sure how to interpret that, but FYI as part of debugging the kernel crashes I've changed some kernel memory barriers that possibly slow page mapping changes a bit. I don't expect any large impact on system speed overall, but I'm unsure.
Following the instructions at Dashboard builders:
hostname openbsd-ppc64-n2vi
CSR is attached after renaming since Github doesn't seem to allow attaching with the name openbsd-ppc64-n2vi.csr you asked for.