urbit / urbit

An operating function
https://urbit.org
MIT License
3.43k stars 359 forks source link

arvo: wipe pending state when changing source #6209

Open zalberico opened 1 year ago

zalberico commented 1 year ago

If a user switches the source of their desks we should wipe any pending installs.

Today we don't do this, and it causes confusion.

zalberico commented 1 year ago

This is an important one to fix because it enables rolling back bad app releases. CC: @jalehman, @belisarius222, @philipcmonk

This happened with the OTA today:

belisarius222 commented 1 year ago

Huh, yeah I guess |uninstall and then |install doesn't cancel the original Clay request, so if it goes through first, it'll run before the later one. Being able to cancel a merge would certainly be better than not being able to.

I'm not sure how much that feature would help to address the issue in question, though, since a lot of people will finish downloading and installing that first commit before the publisher has a chance to revert it or the subscribers have a chance to cancel it. As Urbit's speed improves, that will get closer to instant, and then this feature would help even less.

Maybe I'm missing something here, but this feels to me more like a problem that we won't ever be able to eradicate. If an app pushes out a bad update, fundamentally there's not much the OS can do to prevent that -- the best it can do in the general case is to reduce the fallout from having installed the bad app.

So another way of looking at this issue is to examine what problems the broken Gora and Citadel apps caused, to see if those could have been mitigated. Was the problem just that the update crashed? Aborting the update seems like ideal behavior to me, not a problem -- it's better not to enter a bad state, but to roll back instead. I thought I saw the original commenter say there was a crash loop though, which sounds like a bug to me. An update should either succeed or fail, not spin (an infinite loop in a single event could happen, but a ctrl-c or event timeout should reliably break the loop).

Improving error reporting might also be a worthwhile mitigation approach. If the system, say, pushed a notification to the user that read "%gora crashed during update", with buttons "abort update" and "suspend %gora and retry update", with an expandable stack trace, that sounds nice to me. macOS has a "some apps prevented shutdown" dialog that's a lot like this.

My understanding is that if the user had a pending update to gora for a future kelvin, and then the gora devs push out a fix for that kelvin before the user's kernel updates, Clay will replace its pending commit on the gora desk with the later one. This seems like the right behavior to me, since it ensures that if the gora devs push out a fix before the network-wide kelvin release, they can rest assured that when the release does go out, their users will try to update to the fixed code rather than the old broken code.

On Tue, Jan 31 2023 at 3:44 PM, Zach Alberico < @.*** > wrote:

This is an important one to fix because it enables rolling back bad app releases. CC: @jalehman ( https://github.com/jalehman ) , @belisarius222 ( https://github.com/belisarius222 ) , @philipcmonk ( https://github.com/philipcmonk )

This happened with the OTA today:

  • Gora and Citadel signaled support for 415, but didn't confirm that it worked on 415 (we should be more explicit that testing is required going forward and not optional).
  • They shipped 415 versions of their apps that went out and were pending on user's ships
  • We updated base and the binary for release today
  • As part of that users attempted to install to update, but it crashed because of gora and citadel
  • Gora and Citadel can't quickly revert 415 support because we don't clear pending state (so their revert doesn't do anything to stop this).

— Reply to this email directly, view it on GitHub ( https://github.com/urbit/urbit/issues/6209#issuecomment-1411109599 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAGVR5KZFGL7RNZNYBQTQL3WVGBSLANCNFSM6AAAAAAT7KYQLI ). You are receiving this because you were mentioned. Message ID: <urbit/urbit/issues/6209/1411109599 @ github. com>

zalberico commented 1 year ago

It's primarily an issue with breaking changes and new Kelvin versions.

Most users are blocked on the new runtime, but ignoring that detail (since one day we want that to not require lots of manual effort) - if gora ships a bad version pre-release then I want them to be able to ship a fix pre-release that replaces the bad one.

Currently because of this issue, that's not possible and that's a problem.

Edit:

My understanding is that if the user had a pending update to gora for a future kelvin, and then the gora devs push out a fix for that kelvin before the user's kernel updates, Clay will replace its pending commit on the gora desk with the later one.

If that's true then maybe I'm confused about the gora case.

I think there's still a problem though, if you want to switch from pre-release to release, you should be able to but currently you can't because the pending pre-release version doesn't clear. That was my original issue when I wrote this. It sounds like the gora case isn't a case of that though.

zalberico commented 1 year ago

For clarity the gora issue seen here was:

I'm not sure exactly what the gora issue was, but @Fang- knows. Automatically suspending if possible would be good, but I'm not sure that is possible given the nature of the issue? I'm not sure why base didn't upgrade first independently of gora, but probably I don't understand the specifics of the gora issue enough to reason about, I'll defer to @Fang- on those.

The original issue was you can't switch from a pre-release source to a non-pre-release source and have a pending update from the pre-release source get wiped.

zalberico commented 1 year ago

Another user just hit this too, but they also had citadel so I'm not sure if that was why (I had them suspend both and then it worked).

It might be that gora didn't get the update to it's pending update either (or it was citadel), still - there's an issue with this.

philipcmonk commented 1 year ago

A few things going on here:

zalberico commented 1 year ago

I created #6285 which is a related, but separate issue.

zalberico commented 1 year ago

Something is going on here that doesn't make sense: https://pastebin.com/1aENHTaE

Why is there a pending update for 416 when 415 is installed? Is this a visualization bug (pending doesn't show multi-kernal versions properly or something?)

%groups
  /sys/kelvin:      [%zuse 417] [%zuse 416] [%zuse 415]
  base hash:        0vs.00vi9.h6fga.bf0tg.u1ifm.pqkb9.uihiu.h1vg2.td5qn.cfcak.n5mvl
  %cz hash:         0vs.00vi9.h6fga.bf0tg.u1ifm.pqkb9.uihiu.h1vg2.td5qn.cfcak.n5mvl
  app status:       running
  force on:         ~
  force off:        ~
  publishing ship:  [~ ~sogryp-dister-dozzod-dozzod]
  updates:          remote
  source ship:      ~sogryp-dister-dozzod-dozzod
  source desk:      %groups
  source aeon:      11
  kids desk:        ~
  pending updates:  ~[[%zuse 416]]