pop-os / upgrade

Utility for upgrading Pop!_OS and its recovery partition to new releases.
GNU General Public License v3.0
95 stars 29 forks source link

Resumable downloads & cancelability improvements #261

Closed mmstick closed 2 years ago

mmstick commented 2 years ago

Utilizes our async-fetcher crate for fetching the recovery ISO. You can test it by running

pop-upgrade recovery upgrade from-release

Then Ctrl + C somewhere in-between the download to cancel the process.

Running the same command will resume where it left off.

Refactors some of the daemon service to utilize and support tokio. Long term dbus and dbus-crossroads should be replaced with zbus, and the service rearranged to be more async-friendly.

jacobgkau commented 2 years ago

1d85ded failed to build on all platforms. Also fails to build locally.

error[E0107]: this struct takes 0 lifetime arguments but 1 lifetime argument was supplied
   --> src/main.rs:193:26
    |
193 | async fn main_(matches: &ArgMatches<'_>) -> anyhow::Result<()> {
    |                          ^^^^^^^^^^---- help: remove these generics
    |                          |
    |                          expected 0 lifetime arguments
    |
note: struct defined here, with 0 lifetime parameters
   --> /home/system7/upgrade/vendor/clap/src/parse/matches/arg_matches.rs:71:12
    |
71  | pub struct ArgMatches {
    |            ^^^^^^^^^^

error[E0581]: return type references lifetime `'_`, which is not constrained by the fn input types
   --> src/main.rs:193:45
    |
193 | async fn main_(matches: &ArgMatches<'_>) -> anyhow::Result<()> {
    |                                             ^^^^^^^^^^^^^^^^^^

Some errors have detailed explanations: E0107, E0581.
For more information about an error, try `rustc --explain E0107`.
warning: `pop-upgrade` (bin "pop-upgrade") generated 20 warnings
error: could not compile `pop-upgrade` due to 2 previous errors; 20 warnings emitted
make[2]: *** [Makefile:75: target/release/pop-upgrade] Error 101
make[2]: Leaving directory '/home/system7/upgrade'
make[1]: *** [debian/rules:18: override_dh_auto_build] Error 2
make[1]: Leaving directory '/home/system7/upgrade'
make: *** [debian/rules:7: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2
mmstick commented 2 years ago

Should be fixed now.

jacobgkau commented 2 years ago

If I start the upgrade from a terminal, cancel it with Ctrl-C, then open the GUI and try to download it there, I get a process has been cancelled error:

Screenshot from 2022-02-07 12-22-16

It seems to work the second time I try after having cancelled, but not the first. (Also noticed while testing that the CLI progress incorrectly goes to 100%/fully fetched after cancelling partway through, as is visible in the screenshot.)

Resuming does seem to work, so far.

jacobgkau commented 2 years ago

On 5b0240f, both of the described issues are resolved. Resuming from cancelled downloads is still working.

While testing behavior when switching network connections (e.g. plugging in/unplugging Ethernet while also connected to WiFi), I saw that sometimes (after the ~second connection change and onwards) the reported progress is doubled when the download starts progressing again. In the GUI, this results in impossible progress of greater than 100% being reported:

Screenshot from 2022-02-07 13-30-55

In the CLI, the download seemed to stop when the incorrect progress number reached 100%, followed by errors in journalctl about the recovery files failing to extract:

Screenshot from 2022-02-07 13-34-26

Powering off and then attempting to access Recovery with the spacebar shows that it was rendered unbootable by that error.

After the errors (before powering off), I also had three of what seem to be the bad recovery image showing up in Nautilus:

Screenshot from 2022-02-07 13-35-10

mmstick commented 2 years ago

Automatic retries aren't resetting progress. Will fix.

mmstick commented 2 years ago

The progress should now get properly reset when a retry is being performed. The the result of the recovery fetch task wasn't being returned, which caused it to believe that the task was Ok.

jacobgkau commented 2 years ago

On a103309 with the CLI, the number fetched is now going back down to 0 when I cancel with Ctrl-C. That's less of a misrepresentation than jumping to 100%, but still inaccurate and could be annoying if someone's trying to troubleshoot.

jacobgkau commented 2 years ago

Progress reporting appears to be working correctly, or at least much better, now. The CLI keeps its current number when I cancel with Ctrl-C.

After canceling/resuming and switching network connections a few times, I eventually got a failed to download ISO error message after unplugging Ethernet:

Screenshot from 2022-02-08 09-56-33

I clicked Update again, and the download resumed from where it was (near the end), but once it finished, the checksum was invalid:

Screenshot from 2022-02-08 09-57-25

After that, when I click Update again, it seems to just try and verify what's already downloaded again (which fails again):

Screenshot from 2022-02-08 09-59-57

Screenshot from 2022-02-08 10-00-03

We might need to delete the iso if the checksum verification fails? (As for why it was corrupted in the first place, it seemed like the progress might have been jumping slightly after some network change retries, but the amount was so small that it could have also just been e.g. the last bit of progress from before the disconnection not being reported until it's reconnected, so I can't say for certain.)

jacobgkau commented 2 years ago

The first time I tried to update recovery on 0ce6eab, I got the "failed to verify" error again, but the second time, it started downloading from scratch, so it looks like it removes after failure to verify now. I guess it makes sense that we don't want to automatically retry indefinitely, so this behavior looks good.

When resuming after cancelling, I am sometimes seeing a large delay/hang when I try to resume the download:

Screenshot from 2022-02-08 11-07-45

When I tried to Ctrl-C from that the first time it happened, I continued seeing MessageCall log messages for a minute, although it did eventually exit:

Screenshot from 2022-02-08 11-08-19

On trying again, the upgrade resumed as I would expect; the next time I resumed, it hung again and didn't exit even after waiting many minutes after pressing Ctrl-C. Every time I pressed Ctrl-C, I saw another cancelling a process which is in progress log message, but it never actually exited.

The hang seems to always happen the first time I resume after switching network connection, e.g. to recreate:

  1. Start the download.
  2. Switch network connection (plug in/unplug Ethernet).
  3. Wait for the download to resume and start progressing again.
  4. Cancel manually with Ctrl-C.
  5. Try to resume by running the command again.
  6. Observe hang.

The one where it never exited was on WiFi, so I'm wondering if network connection/speed has something to do with it. The last one that I recreated on Ethernet exited immediately when I pressed Ctrl-C, with a different error message in the logs:

Screenshot from 2022-02-08 11-28-36

I had never seen a hang at this point until 0ce6eab, so it could be related to the version of async-fetcher that was changed with that commit.

mmstick commented 2 years ago

The change in that commit is that it flushes the file when canceled. Perhaps I should have it flush more often, or do the flush in the background.

jacobgkau commented 2 years ago

Still able to create a hang using the steps listed before on d225885.

mmstick commented 2 years ago

Testing out a fix but I think it's working. Will update if it does.

jacobgkau commented 2 years ago

The hang described in https://github.com/pop-os/upgrade/pull/261#issuecomment-1032933185 is no longer occurring! :tada:

On a0c6cd2, I am now unable to cancel with Ctrl-C in the terminal when downloading over Ethernet (seeing the progress go up by about 100 MiB/s). After attempting to cancel, the progress continues going up until I unplug the Ethernet cable, at which point it finally fails/cancels with recovery upgrade aborted: failed to download iso. This does not happen on WiFi; when I press Ctrl-C on WiFi, the download stops with that message after just a few seconds, without having to disconnect.

On WiFi, the progress still goes up at least one more time after I start the cancel operation. I assume the issue on Ethernet has something to do with the much higher speed (whatever mechanism is supposed to cancel not having time to run/complete.)

mmstick commented 2 years ago

I think you're right about the higher speed not giving time to yield to the next future. I'll add some yield calls and see if that helps.

mmstick commented 2 years ago

I'm on a decent 16 MiB/s Ethernet connection right now and can cancel as it's downloading, and while the network is disconnected.

jacobgkau commented 2 years ago

I am able to cancel on Ethernet again on 7437eec, but the freeze from https://github.com/pop-os/upgrade/pull/261#issuecomment-1032933185 is now happening again.

mmstick commented 2 years ago

I'll ping you when I see all of these instances resolved.

mmstick commented 2 years ago

Okay, I'm not seeing any hangs when canceling during network loss, when canceling after networking has been restored, or when resuming after either of those conditions.

jacobgkau commented 2 years ago

Okay, I'm not seeing any hangs when canceling during network loss, when canceling after networking has been restored, or when resuming after either of those conditions.

I know that wasn't a ping, but I'm still seeing the hang on fbdad21... after upgrading pop-upgrade, I rebooted, confirmed the pop-upgrade version, started download on WiFi, cancelled, resumed on WiFi, plugged in Ethernet, cancelled, and attempted to resume.

jacobgkau commented 2 years ago

In case it's useful, when it gets into this "hung" state, if I try to cancel with Ctrl-C, it still waits a while, but it does seem to exit eventually-- and much quicker on Ethernet than WiFi. I get the failed to find build messages as shown in the last screenshot of https://github.com/pop-os/upgrade/pull/261#issuecomment-1032933185 when it finally does exit after cancelling from a hang.

mmstick commented 2 years ago

That is a good clue

mmstick commented 2 years ago

@jacobgkau Added a 1 second timeout for a stalled connection for the code that checks for new releases. And changed the way that cancels are performed, with a method that's much quicker about canceling. The pop-upgrade client was also only checking for status updates every 3 seconds, so I've changed that to 0.5 seconds so it appears to hang for much less time.

jacobgkau commented 2 years ago

On a14f6a7, I get an inaccurate "refueled and ready to go" message on cancelling (with a missing line break):

Screenshot from 2022-02-10 10-00-03

mmstick commented 2 years ago

@jacobgkau Cancelations will now return errors and show them correctly in the client.

jacobgkau commented 2 years ago

Looks better now: Screenshot from 2022-02-10 11-42-18

CLI:

GUI:

I now had two problems near the end of testing:

mmstick commented 2 years ago

It is peculiar since I can't trigger it with async-fetcher's own utility with an interrupt. I suspect it's an issue somewhere in how it cleans up after a cancelation. I can continue looking at that area to see if there's a way to improve the shutdown of the future that is concatenating to that file.

mmstick commented 2 years ago

This change may help. Canceling the fetcher shouldn't abruptly cancel writes to the ISO that are mid-write. The task will complete the write that it is currently performing, and flush the file before exiting.

jacobgkau commented 2 years ago

Still got a corrupted iso with the same test procedure. Interestingly, (in addition to the earlier timeout after unplugging Ethernet after the window had been closed/opened), I got a timeout this time just before the download finished, even though I hadn't changed network connection immediately before this timeout: Screenshot from 2022-02-10 13-05-07

When I clicked Update again, it almost immediately started verifying the image, but the checksum didn't match: Screenshot from 2022-02-10 13-05-19

mmstick commented 2 years ago

Maybe there's a bug in the HTTP client I'm using with its own internal timeouts. I'll try disabling the low_speed_timeout configuration I've set.

jacobgkau commented 2 years ago

It seems like 1eb9567 has prevented the download from resuming after a network connection change-- starting on WiFi and plugging in Ethernet, the progress stops and doesn't start going up again.

mmstick commented 2 years ago

Network changes do seem to give libcurl grief. I've enabled async-fetcher's own timeout configuration with a 5s timeout on a read from the HTTP server that doesn't progress.

jacobgkau commented 2 years ago
Failing to compile:
``` Compiling async-fetcher v0.8.0 (https://github.com/pop-os/async-fetcher#c190e193) Compiling apt-cmd v0.4.0 (https://github.com/pop-os/apt-cmd/#e4484b30) Compiling hyper-tls v0.5.0 Compiling reqwest v0.11.9 warning: `std::sync::MutexGuard` held across a suspend point, but should not be --> src/daemon/mod.rs:703:17 | 703 | let mut lock = cr.lock().unwrap(); | ^^^^^^^^ ... 719 | self_upgrade(&packages).await; | ----------------------------- the value is held across this suspend point | = note: `#[warn(must_not_suspend)]` on by default note: holding a MutexGuard across suspend points can cause deadlocks, delays, and cause Futures to not implement `Send` --> src/daemon/mod.rs:703:17 | 703 | let mut lock = cr.lock().unwrap(); | ^^^^^^^^ help: consider using a block (`{ ... }`) to shrink the value's scope, ending before the suspend point --> src/daemon/mod.rs:703:17 | 703 | let mut lock = cr.lock().unwrap(); | ^^^^^^^^ error[E0308]: mismatched types --> src/release/mod.rs:197:53 | 197 | let (fetcher, mut events) = PackageFetcher::new(fetcher).concurrent(CONCURRENT_FETCHES).fetch( | ^^^^^^^ expected struct `Fetcher`, found struct `Arc` | = note: expected struct `Fetcher` found struct `Arc>` For more information about this error, try `rustc --explain E0308`. warning: `pop-upgrade` (lib) generated 1 warning error: could not compile `pop-upgrade` due to previous error; 1 warning emitted warning: build failed, waiting for other jobs to finish... ```
jacobgkau commented 2 years ago

Still able to recreate the timeout after closing/opening Settings during download and then plugging in Ethernet and unplugging it again, and the checksum was invalid again-- back to how things were in https://github.com/pop-os/upgrade/pull/261#issuecomment-1035442187.

When trying to recreate it a second time on the same boot (planning to skip the last set of network switches to avoid the timeout), I ended up having the progress in the GUI stall on WiFi after unplugging Ethernet; after closing and re-opening Settings, I had the green Upgrade button again, and clicking it made the button go away but didn't show the progress again: Screenshot from 2022-02-10 16-02-22

To try and narrow down the corruption issue, I've tried downloads in the CLI with only manual cancel/resumes and only automatic network reconnections (the latter gave me one failed to download ISO error immediately after unplugging Ethernet, which I had to manually resume from); both of those downloads completed and the checksum was valid.

Next, I rebooted, opened the GUI, and tried one with just network disconnects/reconnects. I was also able to recreate an immediate timeout after one of the network switches, and had another immediate timeout when the download was about to finish, but no checksum failure after clicking the Upgrade button again after both of those.

Rebooted again, started the upgrade in the GUI on Ethernet, unplugged & waited for resume, closed Settings, opened it again, plugged Ethernet back in & let it finish-- got an immediate timeout at the end of the download, but no checksum mismatch.

@mmstick Are you able to recreate the timeout at the end of the download with the below steps?

  1. Start upgrade in GUI on Ethernet.
  2. Unplug & wait for resume via WiFi.
  3. Close Settings, open it again.
  4. Plug Ethernet back in, wait for download to finish (observe immediate timeout, but finishes successfully after clicking Upgrade again.)

And/or are you able to recreate the checksum mismatch with the below steps?

  1. Start download in CLI on WiFi.
  2. Cancel with Ctrl-C, then resume.
  3. Plug in Ethernet, wait for progress to start going up again.
  4. Cancel with Ctrl-C, then resume.
  5. Unplug Ethernet, wait for progress to start going up again.
  6. Cancel with Ctrl-C, then resume.
  7. Cancel with Ctrl-C again.
  8. Open the GUI, click Update.
  9. Plug in Ethernet, wait for progress to start going up again.
  10. Unplug Ethernet, wait for progress to start going up again.
  11. Close/re-open Settings.
  12. Plug in Ethernet again, wait for download to finish (observe immediate timeout, then MD5 mismatch after clicking Upgrade again.)

Seems like there are at least two problems: "timeouts" sometimes occurring in far less than 5 seconds (sometimes right at the end of the download), and the longer sequence resulting in a corrupted download for whatever reason (plus whatever GUI quirk that was which cased a non-working Update button.)

mmstick commented 2 years ago

I've made a few changes that have resolved the issues.

So I'm able to dynamically enable and disable Ethernet and WiFi and cancel the download while still getting a successful checksum validation on completion.

jacobgkau commented 2 years ago

Recovery after network connection change is now usually much faster than before.

CLI:

GUI:

When following the above testing procedure, I re-opened the GUI window, plugged in Ethernet, waited for the download to mostly finish, and then unplugged Ethernet; the download stalled at that point.

Screenshot from 2022-02-11 10-06-50

After closing and re-opening Settings, I had the same issue as in https://github.com/pop-os/upgrade/pull/261#issuecomment-1035643199 where re-opening Settings showed the Upgrade button again, and clicking it didn't start/resume the download.

Screenshot from 2022-02-11 10-08-18 Screenshot from 2022-02-11 10-08-21

After manually restarting the pop-upgrade service, the download resumed in that GUI window without having to click anything else; however, the checksum was mismatched when it finished:

Screenshot from 2022-02-11 10-11-16

Maybe whatever caused that last stall-out/hang meant that the file didn't get a chance to be flushed.

mmstick commented 2 years ago

I think the GUI may be hanging on the release check since it's also using libcurl and there's no network connection detection code in pop-upgrade. I can import the functionality here to catch that.

jacobgkau commented 2 years ago

24f2329 hung when plugging Ethernet back in after re-opening the GUI (one step before where it hung in https://github.com/pop-os/upgrade/pull/261#issuecomment-1036429756, but similar in that it was a network reconnect after re-opening); still had the issue with the Update button disappearing without doing anything, still had a checksum mismatch.

mmstick commented 2 years ago

I have something that works to resolve the GTK client issues. Which required gracefully handling service termination from systemd and altering the order that DBus events and DBus status are set to give time for the client to receive the last results before the service is terminated. As well as switching to a different tool for async cancelation that can both signal a cancelation and wait for things to finish canceling that have requested to delay the shutdown.

mmstick commented 2 years ago

Found issue with package updates. Calling pop-upgrade release update, canceling, then redoing it blocks the daemon.

mmstick commented 2 years ago

Issue resolved

jacobgkau commented 2 years ago

Cancelling on the CLI is back to saying failed to download ISO instead of process has been cancelled, is that expected? It was more clear when it said "cancelled."

Screenshot from 2022-02-16 16-05-09

mmstick commented 2 years ago

Error message is now

Recovery upgrade status: recovery upgrade aborted: fetching from https://pop-iso.sfo2.cdn.digitaloceanspaces.com/21.10/amd64/nvidia/6/pop-os_21.10_amd64_nvidia_6.iso failed: task was canceled
jacobgkau commented 2 years ago

CLI:

GUI:

After the above testing procedure (same one as the latter described in https://github.com/pop-os/upgrade/pull/261#issuecomment-1035643199), I still get a bad checksum:

Screenshot from 2022-02-17 10-03-34

jacobgkau commented 2 years ago

In the CLI, it also looks like the final download progress doesn't get reported before the output moves on to calculating the checksum (I was confused why it was checksumming when the download wasn't finished yet):

Screenshot from 2022-02-17 10-09-34

(Ignore the ^[[15~^[[17~ in the output, I just hit the wrong key while trying to printscreen.)

jacobgkau commented 2 years ago
d970e96 is failing to build: ``` Downloaded 336 crates (54.9 MB) in 2.97s (largest was `windows` at 12.7 MB) error: failed to sync Caused by: found duplicate version of package `async-fetcher v0.8.0` vendored from two sources: source 1: registry `crates-io` source 2: https://github.com/pop-os/async-fetcher#1d5adf54 echo 'directory = "vendor"' >> .cargo/config tar pcf vendor.tar vendor rm -rf vendor make[2]: Leaving directory '/home/system76/upgrade' make[1]: Leaving directory '/home/system76/upgrade' dh_clean debian/rules build dh build --with=systemd dh_update_autotools_config dh_autoreconf dh_auto_configure debian/rules override_dh_auto_build make[1]: Entering directory '/home/system76/upgrade' env CARGO_HOME="$(pwd)/target/cargo" make prefix=/usr make[2]: Entering directory '/home/system76/upgrade' rm -rf vendor; tar pxf vendor.tar cargo build "--release" "--frozen" "--offline" error: failed to resolve patches for `https://github.com/rust-lang/crates.io-index` Caused by: failed to load source for dependency `sysfs-class` Caused by: Unable to update https://github.com/pop-os/sysfs-class#ab63e7f6 Caused by: can't checkout from 'https://github.com/pop-os/sysfs-class': you are in the offline mode (--offline) make[2]: *** [Makefile:75: target/release/pop-upgrade] Error 101 make[2]: Leaving directory '/home/system76/upgrade' make[1]: *** [debian/rules:18: override_dh_auto_build] Error 2 make[1]: Leaving directory '/home/system76/upgrade' make: *** [debian/rules:7: build] Error 2 dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2 ```
jacobgkau commented 2 years ago

CLI:

GUI:

fcec0f2 hung when attempting to switch network connection for the second time, from Ethernet to WiFi. Once it hung, Ctrl-C did not stop the process either.

Screenshot from 2022-02-21 13-23-09

After killing pop-upgrade and restarting it, clearing cache, and trying again, I was able to get farther, but hung the first time I tried switching from Ethernet to WiFi in the GUI:

CLI:

GUI:

Screenshot from 2022-02-21 13-30-23

Closing and re-opening the GUI at this point gave me the green Update button again, and upon clicking it, I get no progress again:

Screenshot from 2022-02-21 13-30-46

mmstick commented 2 years ago

That's a big regression for such a small change. Guess I'll revert it.

mmstick commented 2 years ago

Maybe this'll help. Added a shutdown to the concatenator task

jacobgkau commented 2 years ago

CLI:

GUI:

269c98c fixed the hangs, but still getting the checksum mismatch.

Screenshot from 2022-02-21 14-40-38

mmstick commented 2 years ago

I have only one idea for the mismatch. I'm thinking it's possible that a redownload begins to happen while the last fetch is still in the process of handling a shutdown. So I'll figure out a way to prohibit a resume until all spawned tasks are finished.