arvo: wipe pending state when changing source

zalberico commented 1 year ago

Be pointed at a pre-release moon for OTAs
Get an OTA that requires a kelvin change
See the UI in landscape about applications blocking upgrade
Decide you want to switch to regular release
Update your source for %base with |ota ~samzod
Observe that you still have the new version pending and the UI still shows the applications blocking upgrade warning
This is true for any desks with pending state (not just base)

If a user switches the source of their desks we should wipe any pending installs.

Today we don't do this, and it causes confusion.

zalberico commented 1 year ago

This is an important one to fix because it enables rolling back bad app releases. CC: @jalehman, @belisarius222, @philipcmonk

This happened with the OTA today:

Gora and Citadel signaled support for 415, but didn't confirm that it worked on 415 (we should be more explicit that testing is required going forward and not optional).
They shipped 415 versions of their apps that went out and were pending on user's ships
We updated base and the binary for release today
As part of that users attempted to install to update, but it crashed because of gora and citadel
Gora and Citadel can't quickly revert 415 support because we don't clear pending state (so their revert doesn't do anything to stop this).

belisarius222 commented 1 year ago

Huh, yeah I guess |uninstall and then |install doesn't cancel the original Clay request, so if it goes through first, it'll run before the later one. Being able to cancel a merge would certainly be better than not being able to.

I'm not sure how much that feature would help to address the issue in question, though, since a lot of people will finish downloading and installing that first commit before the publisher has a chance to revert it or the subscribers have a chance to cancel it. As Urbit's speed improves, that will get closer to instant, and then this feature would help even less.

Maybe I'm missing something here, but this feels to me more like a problem that we won't ever be able to eradicate. If an app pushes out a bad update, fundamentally there's not much the OS can do to prevent that -- the best it can do in the general case is to reduce the fallout from having installed the bad app.

So another way of looking at this issue is to examine what problems the broken Gora and Citadel apps caused, to see if those could have been mitigated. Was the problem just that the update crashed? Aborting the update seems like ideal behavior to me, not a problem -- it's better not to enter a bad state, but to roll back instead. I thought I saw the original commenter say there was a crash loop though, which sounds like a bug to me. An update should either succeed or fail, not spin (an infinite loop in a single event could happen, but a ctrl-c or event timeout should reliably break the loop).

Improving error reporting might also be a worthwhile mitigation approach. If the system, say, pushed a notification to the user that read "%gora crashed during update", with buttons "abort update" and "suspend %gora and retry update", with an expandable stack trace, that sounds nice to me. macOS has a "some apps prevented shutdown" dialog that's a lot like this.

My understanding is that if the user had a pending update to gora for a future kelvin, and then the gora devs push out a fix for that kelvin before the user's kernel updates, Clay will replace its pending commit on the gora desk with the later one. This seems like the right behavior to me, since it ensures that if the gora devs push out a fix before the network-wide kelvin release, they can rest assured that when the release does go out, their users will try to update to the fixed code rather than the old broken code.

On Tue, Jan 31 2023 at 3:44 PM, Zach Alberico < @.*** > wrote:

This is an important one to fix because it enables rolling back bad app releases. CC: @jalehman ( https://github.com/jalehman ) , @belisarius222 ( https://github.com/belisarius222 ) , @philipcmonk ( https://github.com/philipcmonk )

This happened with the OTA today:

Gora and Citadel signaled support for 415, but didn't confirm that it worked on 415 (we should be more explicit that testing is required going forward and not optional).

They shipped 415 versions of their apps that went out and were pending on user's ships

We updated base and the binary for release today

As part of that users attempted to install to update, but it crashed because of gora and citadel

Gora and Citadel can't quickly revert 415 support because we don't clear pending state (so their revert doesn't do anything to stop this).

— Reply to this email directly, view it on GitHub ( https://github.com/urbit/urbit/issues/6209#issuecomment-1411109599 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAGVR5KZFGL7RNZNYBQTQL3WVGBSLANCNFSM6AAAAAAT7KYQLI ). You are receiving this because you were mentioned. Message ID: <urbit/urbit/issues/6209/1411109599 @ github. com>

zalberico commented 1 year ago

It's primarily an issue with breaking changes and new Kelvin versions.

Most users are blocked on the new runtime, but ignoring that detail (since one day we want that to not require lots of manual effort) - if gora ships a bad version pre-release then I want them to be able to ship a fix pre-release that replaces the bad one.

Currently because of this issue, that's not possible and that's a problem.

Edit:

My understanding is that if the user had a pending update to gora for a future kelvin, and then the gora devs push out a fix for that kelvin before the user's kernel updates, Clay will replace its pending commit on the gora desk with the later one.

If that's true then maybe I'm confused about the gora case.

I think there's still a problem though, if you want to switch from pre-release to release, you should be able to but currently you can't because the pending pre-release version doesn't clear. That was my original issue when I wrote this. It sounds like the gora case isn't a case of that though.

zalberico commented 1 year ago

For clarity the gora issue seen here was:

User has pending 415 update for gora and base
User installs 1.18 runtime
User runs |bump or suspends and updates via UI
User observes base upgrade fail because of gora and retry in a loop

I'm not sure exactly what the gora issue was, but @Fang- knows. Automatically suspending if possible would be good, but I'm not sure that is possible given the nature of the issue? I'm not sure why base didn't upgrade first independently of gora, but probably I don't understand the specifics of the gora issue enough to reason about, I'll defer to @Fang- on those.

The original issue was you can't switch from a pre-release source to a non-pre-release source and have a pending update from the pre-release source get wiped.

zalberico commented 1 year ago

Another user just hit this too, but they also had citadel so I'm not sure if that was why (I had them suspend both and then it worked).

It might be that gora didn't get the update to it's pending update either (or it was citadel), still - there's an issue with this.

philipcmonk commented 1 year ago

A few things going on here:

The retry loop is likely not infinite. I forget the details, but you often end up with several "wick" events (maybe one per desk?), so it'll try several times to upgrade. When one succeeds, the others quickly no-op, but if they fail they just try again. This is a bug.
I think we don't want to suspend a running app unless the user has affirmatively requested it, either by doing it manually or clicking "suspend apps and update".
"Cancelling a merge" is not the right way to think about it, it's just a commit-in-waiting. Everything's done except applying the commit, and so it's technically easy to just delete the commit-in-waiting whenever we want.
I do think it's a good idea to clear it when you change sources; otherwise, if your new source doesn't have a commit for that future kelvin, and then you try to upgrade to it, you'll be surprised that you use a commit from your old source.
There's a debug command to clear a pending update, which could be useful for some people: |pass [%c %stir %stay %desk ~] (replace %desk)
This is true:

My understanding is that if the user had a pending update to gora for a future kelvin, and then the gora devs push out a fix for that kelvin before the user's kernel updates, Clay will replace its pending commit on the gora desk with the later one.

zalberico commented 1 year ago

I created #6285 which is a related, but separate issue.

zalberico commented 1 year ago

Something is going on here that doesn't make sense: https://pastebin.com/1aENHTaE

Why is there a pending update for 416 when 415 is installed? Is this a visualization bug (pending doesn't show multi-kernal versions properly or something?)

%groups
  /sys/kelvin:      [%zuse 417] [%zuse 416] [%zuse 415]
  base hash:        0vs.00vi9.h6fga.bf0tg.u1ifm.pqkb9.uihiu.h1vg2.td5qn.cfcak.n5mvl
  %cz hash:         0vs.00vi9.h6fga.bf0tg.u1ifm.pqkb9.uihiu.h1vg2.td5qn.cfcak.n5mvl
  app status:       running
  force on:         ~
  force off:        ~
  publishing ship:  [~ ~sogryp-dister-dozzod-dozzod]
  updates:          remote
  source ship:      ~sogryp-dister-dozzod-dozzod
  source desk:      %groups
  source aeon:      11
  kids desk:        ~
  pending updates:  ~[[%zuse 416]]

urbit / urbit

arvo: wipe pending state when changing source #6209