Resumable downloads & cancelability improvements

mmstick commented 2 years ago

Utilizes our async-fetcher crate for fetching the recovery ISO. You can test it by running

pop-upgrade recovery upgrade from-release

Then Ctrl + C somewhere in-between the download to cancel the process.

Running the same command will resume where it left off.

Refactors some of the daemon service to utilize and support tokio. Long term dbus and dbus-crossroads should be replaced with zbus, and the service rearranged to be more async-friendly.

mmstick commented 2 years ago

TimedOut errors may have been the cause. It was giving me issues with network changes causing a hang on a retry to a dead interface.

mmstick commented 2 years ago

I've had two successful recovery upgrades with multiple disconnects and reconnects between WiFi and Ethernet, and combinations of connecting to both simultaneously and unplugging the other.

jacobgkau commented 2 years ago

My testing still got a checksum mismatch:

Screenshot from 2022-02-21 19-26-00

Steps were the same I've been using:

Start in CLI on WiFi.
Cancel/Resume.
Plug in Ethernet, wait for resume.
Cancel/Resume.
Unplug Ethernet, wait for resume.
Cancel/Resume.
Cancel.
Open GUI, resume on WiFi.
Plug in Ethernet, wait for resume.
Unplug Ethernet, wait for resume.
Close Settings, wait a few seconds; open Settings again, observe progress moved while it was closed and is still moving.
Plug in Ethernet, wait for it to mostly finish (~200MB left).
Unplug Ethernet, wait for resume, wait for finish.

mmstick commented 2 years ago

Still getting successful recovery upgrades with these exact steps, but I'll keep trying.

jacobgkau commented 2 years ago

Still getting a checksum error on cb4cd49. I will try some different hardware and see if I can narrow it down to anything. Let me know if there's any way I can get more info about what might be happening. Here's the log from my most recent test (using Dropbox because GitHub won't let me attach right now): https://www.dropbox.com/s/vi5gcp4888r17l0/upgrade.txt?dl=0

mmstick commented 2 years ago

How fast is your download? I guess it's possible for a race condition if the network is fast enough to download a part before a network-interrupted fetch could shut down.

mmstick commented 2 years ago

Logs are fine. Behaving the same as my systems, besides the checksum mismatch.

jacobgkau commented 2 years ago

How fast is your download? I guess it's possible for a race condition if the network is fast enough to download a part before a network-interrupted fetch could shut down.

I'm seeing ~80-100 MB/s on Ethernet, and ~5-8 MB/s on WiFi.

I've been testing on the darp7 with a USB-C Ethernet adapter, and confirmed the same issue occurs with the built-in Ethernet jack.

mmstick commented 2 years ago

I hope this works. Added some logs to be able to compare when a fetch has flushed to when a subsequent fetch has begun. Made some further pruning of shutdown canceling. Testing it now.

mmstick commented 2 years ago

Two successful recovery upgrades with the latest update. Although my Ethernet maxes out at 16 MB/s. There's some very verbose logging from async-fetching showing the order that parts are being fetched, concatenated, and flushed.

jacobgkau commented 2 years ago

Here is a log from ec62690 ending with the checksum mismatch: versbose-upgrade.txt

mmstick commented 2 years ago

According to the logs, no synchronization issues anywhere. Everything is systematically performing all operations concurrently in the same order. The recovery ISO completes a successful part concatenation and flush with each cancelation, time out, and network connection change before any subsequent request is started. Looking at other areas then...

jacobgkau commented 2 years ago

I tried the same darp7 at the factory (significantly faster WiFi than mine, but slightly slower Ethernet), and it still had the checksum error.

I also tried with a gaze16 at the factory, and it did not have the checksum mismatch issue. Both machines had NVMe SSDs (the gaze16 had a Western Digital drive, while the darp7 has a Samsung drive.) I'll try to see if I can identify the difference on the darp7 causing it to fail more easily (hoping it's not just hardware.)

mmstick commented 2 years ago

I'll push a small update with some potential fixes in a minute.

jacobgkau commented 2 years ago

I can see that 1f9053a used larger part sizes (resumes tended to start farther back from the current position than they did before), but I still got the checksum error on darp7. Log in case it's useful: 1f9053a-darp7.txt

I plan to swap the darp7's SSD out with another system to check if it's the darp7's other hardware or the SSD/software that's causing the difference between it and other systems, then try to narrow the software/configuration down from there.

mmstick commented 2 years ago

We could switch to a single connection download for the ISO and see if that works.

jacobgkau commented 2 years ago

Ok, I recreated the checksum issue on the first try on lemp10 with a fresh 21.10 installation on a different SSD. lemp10 and darp7 are very similar internally, and the SSD was also a Samsung (darp7 has 980 Pro, lemp10 has 970 Evo Plus), but that at least rules out any software config on the darp7. (The darp7 install got its primary keyboard layout for cryptsetup/GDM switched to a different language a while back, so I wanted to confirm that wasn't somehow causing this.)

mmstick commented 2 years ago

That's a bummer. Switching to one connection in case that improves stability

mmstick commented 2 years ago

Have to make some changes since single-connection downloads hasn't been tested yet

mmstick commented 2 years ago

Had two successful recovery upgrades with a single connection now. There's less moving parts here so it should have a higher chance of success in the face of interruptions and cancelations.

jacobgkau commented 2 years ago

Still got the checksum error on darp7 with 253dcb5. 253dcb5-darp7.txt

mmstick commented 2 years ago

Perhaps a sync to disk before we get the length of the file between each retry.

jacobgkau commented 2 years ago

No change: 7f10aee-darp7.txt

mmstick commented 2 years ago

I'm out of ideas for now

jacobgkau commented 2 years ago

Just spitballing here, but would logging the checksum after each concat and/or logging the checksum of each part downloaded possibly help see where the corruption is happening? Assuming that if the chunks being downloaded are a constant size, then the checksum at a given progress point should match between systems where this is and isn't happening.

mmstick commented 2 years ago

Will try that

jacobgkau commented 2 years ago

Here are logs with checksums from b6d88a1: b6d88a1-darp7.txt

mmstick commented 2 years ago

Some issues at the moment with the transition. I'll notify when it's ready again.

mmstick commented 2 years ago

Okay. The transition from using the isahc HTTP client to reqwest is ready for testing.

jacobgkau commented 2 years ago

Still got the checksum error on 154dd2f: 154dd2f-darp7.txt

Comparing this log with the last one, I see that the concat checksums matched up to 1D88AAC5E6D859B7E5A1091728A7C34D, then differed after the fourth (last) manual cancel on the earlier run. After that, the next concat from the earlier run was 7969B161E66449732D6A10C0BD7B2DDD, which is not present in the later run. That last cancel would have been resumed from the GUI (with everything before then being performed on the CLI), that's the only unique thing about it I can think of compared with the earlier cancels.

mmstick commented 2 years ago

Between cancels, have you noticed any cancels that had a gap or overlap in the bytes range?

get_many.rs:54: GET "/var/cache/pop-upgrade/recovery.iso" bytes=3137339392-3138715647

mmstick commented 2 years ago

Seems to have resumed correctly

GET "/var/cache/pop-upgrade/recovery.iso" bytes=872415232-889192447
lib.rs:291: inner_request fetch result is Err(Canceled)
GET "/var/cache/pop-upgrade/recovery.iso" bytes=872415232-889192447

jacobgkau commented 2 years ago

8c20466 still getting wrong checksum: 8c20466-darp7.txt

jacobgkau commented 2 years ago

Failure on 052f3e4: 052f3e4-darp7.txt

Comparing 052f3e4 to 154dd2f, the first 52 chunks (up to ED8A84F04F1806F4CB9C6D7E15C62EEB) had the same checksum, then 052f3e4 got B33EFDB19456767A71C58CA24CE6B93E which was not present in 154dd2f (so is presumably incorrect.) This one happened after the second-to-last cancel/resume, which would not have been after switching to the GUI, but just Ctrl-C and then running the command again in the terminal.

Interestingly, the four chunks 052f3e4 got during that period were all mismatching the other log, but then after resuming from the GUI, it got A2BC4F171FB20360F3301B6CACF78A80, which is present in the other log again. From there on, the rest of the chunks have consistent checksums with the earlier attempt, but the final checksum was mismatched. (Not sure if logging the checksum of the actual final file before or after concats would add any additional value over just checksumming the parts themselves.)

mmstick commented 2 years ago

log.txt

Here's a log of my own from another successful upgrade. I've increased data logged

jacobgkau commented 2 years ago

Failure log from 5499263: 5499263-darp7.txt

Our checksums on the concat lines now already don't match on the first concat:

Mar 02 16:37:43 pop-os pop-upgrade[933]: [DEBUG] get_many.rs:55: RANGE /var/cache/pop-upgrade/recovery.iso:0 - bytes=0-16777215
Mar 02 16:37:43 pop-os pop-upgrade[933]: [DEBUG] get_many.rs:55: RANGE /var/cache/pop-upgrade/recovery.iso:1 - bytes=16777216-33554431
Mar 02 16:37:51 pop-os pop-upgrade[933]: [DEBUG] concatenator.rs:44: CONCAT /var/cache/pop-upgrade/recovery.iso:0 52D54746144D366C200EA41BBF48A724 16777216 bytes

[DEBUG] get_many.rs:55: RANGE /var/cache/pop-upgrade/recovery.iso:0 - bytes=33554432-50331647
[DEBUG] get_many.rs:55: RANGE /var/cache/pop-upgrade/recovery.iso:1 - bytes=50331648-67108863
[DEBUG] concatenator.rs:44: CONCAT /var/cache/pop-upgrade/recovery.iso:0 BE7163C1E37763F6084BF76015ED2958 16777216 bytes

Based on the ranges shown, it looks like yours was already partway done before the log started? I don't see the first few checksums from your log anywhere in mine, although the last few do match (which suggests to me that it's still the individual pieces' checksums being logged.)

mmstick commented 2 years ago

Yeah the first few parts were already fetched

mmstick commented 2 years ago

Yeah, this is strange. The fetched ranges from the darp7 are wrong for many of the fetches. At random points inbetween the checksums differ, and sometimes matches

mmstick commented 2 years ago

Updated logs (full): gaze15.txt

jacobgkau commented 2 years ago

Failure log from 29526df: 29526df-darp7.txt

jacobgkau commented 2 years ago

Failure log from 750ddc3: 750ddc3-darp7.txt

mmstick commented 2 years ago

Closing this for now

pop-os / upgrade

Resumable downloads & cancelability improvements #261