Open efahl opened 2 weeks ago
@aparcar Does my analysis above make any sense? I just tried again, and the metadata is now updated to 26629, but ASU's imagebuilder remains back on 26608 as above...
Error: Received incorrect version r26608-b2a84c86e3 (requested r26629-7a9b1f7f25)
The strangest thing just occurred: I've requested a custom build with ASU (luci-app-attendedsysupgrade), it said it had a newer build available, namely 26620. I requested it, but it came back with the old version stuck at 26593 (as I expected due to this bug). I proceeded to upgrade to this so-called 26593 version, at least that's what ASU said it was.
After it flashed and rebooted, I check the version in LuCI and low and behold, it has actually flashed the latest version 26620. What the heck is going on?
Note: ath79/generic unit WR1043ND v3
luci-app-attendedsysupgrade
There's your culprit in this case. Everything else (auc
, owut
, firmware selector) specifies the version_code
in the build blob, which the ASU builder verifies request and imagebuilder as being the same (https://github.com/openwrt/asu/blob/main/asu/build.py#L72). If you leave it out, then you can sometimes get a build to go through, even when the metadata is different than the actual imagebuilder.
I experimented with an option in owut
to turn off the version code spec, but decided it was more robust to keep it and to try to fix up the builds, because half (or more) of the time it didn't work and just wasted build cycles (got a different error later, like "package version mismatch" or some such thing).
Right now (2024-06-15 18:40Z) the ASU server is ahead of the downloads site:
snapshots x86/64 generic
r26634-7d9b9762c9 ASU Server
r26633-22f92cce22 Build Server
But the ASU builder is still stuck at 26608
Error: Received incorrect version r26608-b2a84c86e3 (requested r26633-22f92cce22)
My interpretation is that the web hooks are not flushing the old imagebuilders (and are firing off at the start of the build, not its completion???).
WNDRMAC v2 (ath79/generic) - I tried to build a new custom firmware with the firmware-selector, it said 26620 is available, but after build it still came back with the old 26593 from 5 days ago.
Then I checked with ASU (luci-app-attendedsysupgrade) and it said 26620 was available, I requested it, came back with 26593, I even downloaded it from the ASU interface to my PC and I confirm it's 26593 indeed. Then I proceeded to ask ASU to actually flash it and, as you might have guessed, it's now on 26620. Very strange.
This is a mess.
Also, since ath79 was bumped from kernel 6.1 to 6.6 yesterday, it doesn't even build at all anymore.
As far as I understand it is a similar problem and is "under construction" https://forum.openwrt.org/t/auc-sysupgrade-does-not-work-for-me/196782/39
There's your culprit in this case. Everything else (
auc
,owut
, firmware selector) specifies theversion_code
in the build blob, which the ASU builder verifies request and imagebuilder as being the same (https://github.com/openwrt/asu/blob/main/asu/build.py#L72). If you leave it out, then you can sometimes get a build to go through, even when the metadata is different than the actual imagebuilder.
In the meantime before this problem get fixed, I've created https://github.com/openwrt/luci/pull/7163 to add the check to luci-app-attendedsysupgrade
so it behaves in the same way as auc
.
My guess is that this issue got multiple layers. First the Docker containers would download from downloads.openwrt.org which is behind a CDN. Even though the cache are cleared on every fresh target upload, there may be a timing issue where Docker containers start downloading ImageBuilders while not yet fully cleared (a matter of a few seconds I think).
Second is that ASU caches successful builds for 7 days. The identify the "same" build request and allow this kind of caching, a bund of parameters are put into account, including version_code
. Since the LuCI lacked that parameter, build requests from up to 7 days ago would be returned. Thanks to @dannil this issue should be fixed for LuCI.
Right now this isn't happening for the firmware selector, so please feel free to implement it there, too.
pm_image.remove(force=True) <<< Remove it before reporting error, or maybe 'reload'???
Using podman.images.pull()
update the container whenever something new is available, no need to remove it locally
As you say in German, all good things are three: The GitHub token ran out over the weekend causing the webhook to never run any container updates.
The GitHub token ran out over the weekend causing the webhook to never run any container updates.
GAH! That explains a lot.
I'm still seeing the same issue. Ran two builds on different x86 machines, about a half hour apart.
Received incorrect version r26657-531b3f667c (requested r26670-5f9fb964c3)
ASU and downloads profiles match with the 26670 version, but the image is updated from previous, but now stuck back at 26657
Builds at https://sysupgrade.openwrt.org/api/v1/build/eaf666a98b390bda3f8896a8b4000ee0 https://sysupgrade.openwrt.org/api/v1/build/eecd821426ce6f223f079b925029be2c if that helps...
Ok, just tried again with the old 'add "#" to uci-defaults' trick for force a new hash, and it worked (i.e., got a valid image and installed it), so the pull
is working as expected.
If I leave off the uci-defaults, I get the same 'incorrect version' as before, from the cached build that failed... Still seems like that remove
might alleviate these "in-between" errors somewhat, at least for that specific error. (This is assuming pm_image.remove
works like I'm assuming, which is completely untested.)
Explain this: Error: Received incorrect version r26674-29b98a0192 (requested r26620-6b03050192)
I actually requested 26674, not 26620 - that's the current version installed.
requested 26674, not 26620
What upgrade client did you use that gave this?
This typically happens when the sysupgrade metadata lags behind the build, which is the issue that Paul's webhook implementation addressed.
requested 26674, not 26620
What upgrade client did you use that gave this?
Latest luci-app-attendedsysupgrade 24.168.60561~8860ca0
requested 26674, not 26620
What upgrade client did you use that gave this?
Latest luci-app-attendedsysupgrade 24.168.60561~8860ca0
We may have borked the check in LuCI ASU with my change in https://github.com/openwrt/luci/pull/7163, see https://github.com/openwrt/luci/pull/7163#issuecomment-2176678529. For now use auc
or the firmware selector to upgrade, I have a fix coming as long as I get confirmation of my suspicions.
Just for your info. I have just upgraded from 23.05-SNAPSHOT r23901 to 23.05-SNAPSHOT r23902-50148a40d2 using luci-i18n-attendedsysupgrade-pl (git-24.167.26541-c74a06c) or according luci-app-attendedsysupgrade (git-23.339.51123-138595a), and to my surprise, it worked fine. The detected "current" and "new offered" version were correct.
according luci-app-attendedsysupgrade (git-23.339.51123-138595a)
That would be as expected, as that version doesn't contain the version_code
fix, so works most of the time. If no version_code
is supplied, then neither LuCI ASU nor ASU build server do any checks and you get whatever the build server happens to have on hand.
So, you mean I was just lucky this time? It wasn't working well for previous tries, yesterday and before. Is it a random process?
It's not so much luck, but rather LuCI ASU did the right thing like 99% of the time.
It all depends on the state of the ASU server's cached builds, and how they correspond to the profile metadata in those two files I linked in the opening comment (actually the ones for your specific platform).
If the version_code
is not specified, as before @dannil's fix to LuCI ASU, then you get whatever happens to be the latest build, assuming there's nothing matching in the ASU server's cache, or the cached imagebuilder if there is.
This works almost all the time, because the number of upgrades people do for a specific platform isn't very big, but if someone did an upgrade in that grey period between when the imagebuilder is published and the metadata is updated to correspond, then things would be out of whack for anyone else using the same platform, up to the time the next build came out and the builders/metadata got updated.
The window used to be "too long" and sometimes would be days, causing lots of issues, and that's where aparcar's "webhooks" patch came from: it reduces the window to seconds instead of minutes, hours or sometimes "forever".
The missing version_code
in LuCI ASU is also the reason that some people would try auc
, get an error, then try LuCI ASU and it "would work". Well, it worked in that you got an upgrade, but it wasn't the latest, it was often the build before the latest for these specific cases.
Thank you efahl. It justifies the cases, when I had to remove or add any packages to succeed with the LuCi upgrade. I will monitor the situation and let you know.
We may have borked the check in LuCI ASU with my change in openwrt/luci#7163, see openwrt/luci#7163 (comment). For now use
auc
or the firmware selector to upgrade, I have a fix coming as long as I get confirmation of my suspicions.
@dannil LuCI ASU is definitely broken now. I've just tried to use it and it errors out with:
Error building the firmware image
Server response: Error: Received incorrect version r26693-368441254e (requested r26679-70088a7e4c)
even though the requested version was 26693 (the new version), not 26679 (currently installed).
We may have borked the check in LuCI ASU with my change in openwrt/luci#7163, see openwrt/luci#7163 (comment). For now use
auc
or the firmware selector to upgrade, I have a fix coming as long as I get confirmation of my suspicions.@dannil LuCI ASU is definitely broken now. I've just tried to use it and it errors out with:
Error building the firmware image Server response: Error: Received incorrect version r26693-368441254e (requested r26679-70088a7e4c)
even though the requested version was 26693 (the new version), not 26679 (currently installed).
https://github.com/openwrt/luci/pull/7168. I will still try to figure out a way to solve the issue with LuCI ASU as well so the version_code
parameter is sent to the API, however the solution became quite more involved since LuCI ASU allows you to sysupgrade to the currently installed version as well if you check the box for advanced mode, and specifying version_code
to an older revision doesn't work on snapshots (the server errors out); for auc
and the firmware selector this isn't a use case so we didn't hit it there.
specifying
version_code
to an older revision doesn't work on snapshots
Older snapshots isn't a use case for any upgrade tool. The only way to get a non-current snapshot is to build from source, since the snapshot imagebuilders and packages aren't archived, just deleted.
All of the following give me the proper value for upgrading to current for that version, just stick that in the request blob.revision_code
and it should work:
https://sysupgrade.openwrt.org/api/v1/revision/22.03.6/x86/64 https://sysupgrade.openwrt.org/api/v1/revision/23.05.3/x86/64 https://sysupgrade.openwrt.org/api/v1/revision/23.05-SNAPSHOT/x86/64 https://sysupgrade.openwrt.org/api/v1/revision/SNAPSHOT/x86/64
(There are still some versions that have not updated properly since the webhooks commit, I've been watching 22.03-SNAPSHOT, which is stuck at 20311 on ASU server, but should be 20315 according to downloads, so avoid that in your testing.)
cc: @lorand-horvath
ASU build has been generating these errors on x86/64 snapshot for a day or so, several similar errors being reported on the forum.
Since the "janitor: drop" commit the metadata updates rapidly, and both https://sysupgrade.openwrt.org/json/v1/snapshots/targets/x86/64/generic.json and https://downloads.openwrt.org/snapshots/targets/x86/64/profiles.json contain the correct 26620
version_code
, so it seems that the previous source of this issue has been resolved.This time it appears to me to be the podman image is out of sync, behind the current downloads imagebuilder. I couldn't find anything in
update.py
that would invalidate or update the podman.images collection.But in
build.py
we know something is wrong. Just a wild, untested idea, when this specific error is detected, maybe just delete the podman image, so the next try does a freshpodman.images.pull
? (At least I'm assuming that's howpull
works.) In build.py:Refs: https://podman-py.readthedocs.io/en/stable/podman.domain.images_manager.html#podman.domain.images_manager.ImagesManager.pull https://podman-py.readthedocs.io/en/stable/podman.domain.images.html#podman.domain.images.Image.remove