openwrt / asu

An image on demand server for OpenWrt based distributions
https://sysupgrade.openwrt.org
GNU General Public License v2.0
301 stars 78 forks source link

build.py: podman imagebuilder out of sync #853

Open efahl opened 2 weeks ago

efahl commented 2 weeks ago

cc: @lorand-horvath

ASU build has been generating these errors on x86/64 snapshot for a day or so, several similar errors being reported on the forum.

Received incorrect version r26608-b2a84c86e3 (requested r26620-6b03050192)

Since the "janitor: drop" commit the metadata updates rapidly, and both https://sysupgrade.openwrt.org/json/v1/snapshots/targets/x86/64/generic.json and https://downloads.openwrt.org/snapshots/targets/x86/64/profiles.json contain the correct 26620 version_code, so it seems that the previous source of this issue has been resolved.

This time it appears to me to be the podman image is out of sync, behind the current downloads imagebuilder. I couldn't find anything in update.py that would invalidate or update the podman.images collection.

But in build.py we know something is wrong. Just a wild, untested idea, when this specific error is detected, maybe just delete the podman image, so the next try does a fresh podman.images.pull? (At least I'm assuming that's how pull works.) In build.py:

    log.info(f"Pulling {image}...")
    pm_image = podman.images.pull(image)        <<< Save the returned image...
    log.info(f"Pulling {image}... done")

...

    if "version_code" in req:
        if version_code != req.get("version_code"):
            pm_image.remove(force=True)          <<< Remove it before reporting error, or maybe 'reload'???
            report_error(

Refs: https://podman-py.readthedocs.io/en/stable/podman.domain.images_manager.html#podman.domain.images_manager.ImagesManager.pull https://podman-py.readthedocs.io/en/stable/podman.domain.images.html#podman.domain.images.Image.remove

efahl commented 2 weeks ago

@aparcar Does my analysis above make any sense? I just tried again, and the metadata is now updated to 26629, but ASU's imagebuilder remains back on 26608 as above...

Error: Received incorrect version r26608-b2a84c86e3 (requested r26629-7a9b1f7f25)

lorand-horvath commented 2 weeks ago

The strangest thing just occurred: I've requested a custom build with ASU (luci-app-attendedsysupgrade), it said it had a newer build available, namely 26620. I requested it, but it came back with the old version stuck at 26593 (as I expected due to this bug). I proceeded to upgrade to this so-called 26593 version, at least that's what ASU said it was.

After it flashed and rebooted, I check the version in LuCI and low and behold, it has actually flashed the latest version 26620. What the heck is going on?

Note: ath79/generic unit WR1043ND v3

efahl commented 2 weeks ago

luci-app-attendedsysupgrade

There's your culprit in this case. Everything else (auc, owut, firmware selector) specifies the version_code in the build blob, which the ASU builder verifies request and imagebuilder as being the same (https://github.com/openwrt/asu/blob/main/asu/build.py#L72). If you leave it out, then you can sometimes get a build to go through, even when the metadata is different than the actual imagebuilder.

I experimented with an option in owut to turn off the version code spec, but decided it was more robust to keep it and to try to fix up the builds, because half (or more) of the time it didn't work and just wasted build cycles (got a different error later, like "package version mismatch" or some such thing).

efahl commented 2 weeks ago

Right now (2024-06-15 18:40Z) the ASU server is ahead of the downloads site:

snapshots x86/64 generic
  r26634-7d9b9762c9  ASU Server
  r26633-22f92cce22  Build Server

But the ASU builder is still stuck at 26608

Error: Received incorrect version r26608-b2a84c86e3 (requested r26633-22f92cce22)

My interpretation is that the web hooks are not flushing the old imagebuilders (and are firing off at the start of the build, not its completion???).

lorand-horvath commented 2 weeks ago

WNDRMAC v2 (ath79/generic) - I tried to build a new custom firmware with the firmware-selector, it said 26620 is available, but after build it still came back with the old 26593 from 5 days ago.

Then I checked with ASU (luci-app-attendedsysupgrade) and it said 26620 was available, I requested it, came back with 26593, I even downloaded it from the ASU interface to my PC and I confirm it's 26593 indeed. Then I proceeded to ask ASU to actually flash it and, as you might have guessed, it's now on 26620. Very strange.

This is a mess.

Also, since ath79 was bumped from kernel 6.1 to 6.6 yesterday, it doesn't even build at all anymore.

pko76 commented 2 weeks ago

As far as I understand it is a similar problem and is "under construction" https://forum.openwrt.org/t/auc-sysupgrade-does-not-work-for-me/196782/39

dannil commented 2 weeks ago

There's your culprit in this case. Everything else (auc, owut, firmware selector) specifies the version_code in the build blob, which the ASU builder verifies request and imagebuilder as being the same (https://github.com/openwrt/asu/blob/main/asu/build.py#L72). If you leave it out, then you can sometimes get a build to go through, even when the metadata is different than the actual imagebuilder.

In the meantime before this problem get fixed, I've created https://github.com/openwrt/luci/pull/7163 to add the check to luci-app-attendedsysupgrade so it behaves in the same way as auc.

aparcar commented 2 weeks ago

My guess is that this issue got multiple layers. First the Docker containers would download from downloads.openwrt.org which is behind a CDN. Even though the cache are cleared on every fresh target upload, there may be a timing issue where Docker containers start downloading ImageBuilders while not yet fully cleared (a matter of a few seconds I think).

Second is that ASU caches successful builds for 7 days. The identify the "same" build request and allow this kind of caching, a bund of parameters are put into account, including version_code. Since the LuCI lacked that parameter, build requests from up to 7 days ago would be returned. Thanks to @dannil this issue should be fixed for LuCI.

Right now this isn't happening for the firmware selector, so please feel free to implement it there, too.

aparcar commented 2 weeks ago
        pm_image.remove(force=True)          <<< Remove it before reporting error, or maybe 'reload'???

Using podman.images.pull() update the container whenever something new is available, no need to remove it locally

aparcar commented 2 weeks ago

As you say in German, all good things are three: The GitHub token ran out over the weekend causing the webhook to never run any container updates.

efahl commented 2 weeks ago

The GitHub token ran out over the weekend causing the webhook to never run any container updates.

GAH! That explains a lot.

efahl commented 2 weeks ago

I'm still seeing the same issue. Ran two builds on different x86 machines, about a half hour apart.

Received incorrect version r26657-531b3f667c (requested r26670-5f9fb964c3)

ASU and downloads profiles match with the 26670 version, but the image is updated from previous, but now stuck back at 26657

Builds at https://sysupgrade.openwrt.org/api/v1/build/eaf666a98b390bda3f8896a8b4000ee0 https://sysupgrade.openwrt.org/api/v1/build/eecd821426ce6f223f079b925029be2c if that helps...

efahl commented 2 weeks ago

Ok, just tried again with the old 'add "#" to uci-defaults' trick for force a new hash, and it worked (i.e., got a valid image and installed it), so the pull is working as expected.

If I leave off the uci-defaults, I get the same 'incorrect version' as before, from the cached build that failed... Still seems like that remove might alleviate these "in-between" errors somewhat, at least for that specific error. (This is assuming pm_image.remove works like I'm assuming, which is completely untested.)

lorand-horvath commented 1 week ago

Explain this: Error: Received incorrect version r26674-29b98a0192 (requested r26620-6b03050192) I actually requested 26674, not 26620 - that's the current version installed.

efahl commented 1 week ago

requested 26674, not 26620

What upgrade client did you use that gave this?

This typically happens when the sysupgrade metadata lags behind the build, which is the issue that Paul's webhook implementation addressed.

lorand-horvath commented 1 week ago

requested 26674, not 26620

What upgrade client did you use that gave this?

Latest luci-app-attendedsysupgrade 24.168.60561~8860ca0

dannil commented 1 week ago

requested 26674, not 26620

What upgrade client did you use that gave this?

Latest luci-app-attendedsysupgrade 24.168.60561~8860ca0

We may have borked the check in LuCI ASU with my change in https://github.com/openwrt/luci/pull/7163, see https://github.com/openwrt/luci/pull/7163#issuecomment-2176678529. For now use auc or the firmware selector to upgrade, I have a fix coming as long as I get confirmation of my suspicions.

pko76 commented 1 week ago

Just for your info. I have just upgraded from 23.05-SNAPSHOT r23901 to 23.05-SNAPSHOT r23902-50148a40d2 using luci-i18n-attendedsysupgrade-pl (git-24.167.26541-c74a06c) or according luci-app-attendedsysupgrade (git-23.339.51123-138595a), and to my surprise, it worked fine. The detected "current" and "new offered" version were correct.

efahl commented 1 week ago

according luci-app-attendedsysupgrade (git-23.339.51123-138595a)

That would be as expected, as that version doesn't contain the version_code fix, so works most of the time. If no version_code is supplied, then neither LuCI ASU nor ASU build server do any checks and you get whatever the build server happens to have on hand.

pko76 commented 1 week ago

So, you mean I was just lucky this time? It wasn't working well for previous tries, yesterday and before. Is it a random process?

efahl commented 1 week ago

It's not so much luck, but rather LuCI ASU did the right thing like 99% of the time.

It all depends on the state of the ASU server's cached builds, and how they correspond to the profile metadata in those two files I linked in the opening comment (actually the ones for your specific platform).

If the version_code is not specified, as before @dannil's fix to LuCI ASU, then you get whatever happens to be the latest build, assuming there's nothing matching in the ASU server's cache, or the cached imagebuilder if there is.

This works almost all the time, because the number of upgrades people do for a specific platform isn't very big, but if someone did an upgrade in that grey period between when the imagebuilder is published and the metadata is updated to correspond, then things would be out of whack for anyone else using the same platform, up to the time the next build came out and the builders/metadata got updated.

The window used to be "too long" and sometimes would be days, causing lots of issues, and that's where aparcar's "webhooks" patch came from: it reduces the window to seconds instead of minutes, hours or sometimes "forever".

The missing version_code in LuCI ASU is also the reason that some people would try auc, get an error, then try LuCI ASU and it "would work". Well, it worked in that you got an upgrade, but it wasn't the latest, it was often the build before the latest for these specific cases.

pko76 commented 1 week ago

Thank you efahl. It justifies the cases, when I had to remove or add any packages to succeed with the LuCi upgrade. I will monitor the situation and let you know.

lorand-horvath commented 1 week ago

We may have borked the check in LuCI ASU with my change in openwrt/luci#7163, see openwrt/luci#7163 (comment). For now use auc or the firmware selector to upgrade, I have a fix coming as long as I get confirmation of my suspicions.

@dannil LuCI ASU is definitely broken now. I've just tried to use it and it errors out with:

Error building the firmware image
Server response: Error: Received incorrect version r26693-368441254e (requested r26679-70088a7e4c)

even though the requested version was 26693 (the new version), not 26679 (currently installed).

dannil commented 1 week ago

We may have borked the check in LuCI ASU with my change in openwrt/luci#7163, see openwrt/luci#7163 (comment). For now use auc or the firmware selector to upgrade, I have a fix coming as long as I get confirmation of my suspicions.

@dannil LuCI ASU is definitely broken now. I've just tried to use it and it errors out with:

Error building the firmware image
Server response: Error: Received incorrect version r26693-368441254e (requested r26679-70088a7e4c)

even though the requested version was 26693 (the new version), not 26679 (currently installed).

https://github.com/openwrt/luci/pull/7168. I will still try to figure out a way to solve the issue with LuCI ASU as well so the version_code parameter is sent to the API, however the solution became quite more involved since LuCI ASU allows you to sysupgrade to the currently installed version as well if you check the box for advanced mode, and specifying version_code to an older revision doesn't work on snapshots (the server errors out); for auc and the firmware selector this isn't a use case so we didn't hit it there.

efahl commented 1 week ago

specifying version_code to an older revision doesn't work on snapshots

Older snapshots isn't a use case for any upgrade tool. The only way to get a non-current snapshot is to build from source, since the snapshot imagebuilders and packages aren't archived, just deleted.

All of the following give me the proper value for upgrading to current for that version, just stick that in the request blob.revision_code and it should work:

https://sysupgrade.openwrt.org/api/v1/revision/22.03.6/x86/64 https://sysupgrade.openwrt.org/api/v1/revision/23.05.3/x86/64 https://sysupgrade.openwrt.org/api/v1/revision/23.05-SNAPSHOT/x86/64 https://sysupgrade.openwrt.org/api/v1/revision/SNAPSHOT/x86/64

(There are still some versions that have not updated properly since the webhooks commit, I've been watching 22.03-SNAPSHOT, which is stuck at 20311 on ASU server, but should be 20315 according to downloads, so avoid that in your testing.)