Closed mmstick closed 2 years ago
TimedOut errors may have been the cause. It was giving me issues with network changes causing a hang on a retry to a dead interface.
I've had two successful recovery upgrades with multiple disconnects and reconnects between WiFi and Ethernet, and combinations of connecting to both simultaneously and unplugging the other.
My testing still got a checksum mismatch:
Steps were the same I've been using:
Still getting successful recovery upgrades with these exact steps, but I'll keep trying.
Still getting a checksum error on cb4cd49. I will try some different hardware and see if I can narrow it down to anything. Let me know if there's any way I can get more info about what might be happening. Here's the log from my most recent test (using Dropbox because GitHub won't let me attach right now): https://www.dropbox.com/s/vi5gcp4888r17l0/upgrade.txt?dl=0
How fast is your download? I guess it's possible for a race condition if the network is fast enough to download a part before a network-interrupted fetch could shut down.
Logs are fine. Behaving the same as my systems, besides the checksum mismatch.
How fast is your download? I guess it's possible for a race condition if the network is fast enough to download a part before a network-interrupted fetch could shut down.
I'm seeing ~80-100 MB/s on Ethernet, and ~5-8 MB/s on WiFi.
I've been testing on the darp7 with a USB-C Ethernet adapter, and confirmed the same issue occurs with the built-in Ethernet jack.
I hope this works. Added some logs to be able to compare when a fetch has flushed to when a subsequent fetch has begun. Made some further pruning of shutdown canceling. Testing it now.
Two successful recovery upgrades with the latest update. Although my Ethernet maxes out at 16 MB/s. There's some very verbose logging from async-fetching showing the order that parts are being fetched, concatenated, and flushed.
Here is a log from ec62690 ending with the checksum mismatch: versbose-upgrade.txt
According to the logs, no synchronization issues anywhere. Everything is systematically performing all operations concurrently in the same order. The recovery ISO completes a successful part concatenation and flush with each cancelation, time out, and network connection change before any subsequent request is started. Looking at other areas then...
I tried the same darp7 at the factory (significantly faster WiFi than mine, but slightly slower Ethernet), and it still had the checksum error.
I also tried with a gaze16 at the factory, and it did not have the checksum mismatch issue. Both machines had NVMe SSDs (the gaze16 had a Western Digital drive, while the darp7 has a Samsung drive.) I'll try to see if I can identify the difference on the darp7 causing it to fail more easily (hoping it's not just hardware.)
I'll push a small update with some potential fixes in a minute.
I can see that 1f9053a used larger part sizes (resumes tended to start farther back from the current position than they did before), but I still got the checksum error on darp7. Log in case it's useful: 1f9053a-darp7.txt
I plan to swap the darp7's SSD out with another system to check if it's the darp7's other hardware or the SSD/software that's causing the difference between it and other systems, then try to narrow the software/configuration down from there.
We could switch to a single connection download for the ISO and see if that works.
Ok, I recreated the checksum issue on the first try on lemp10 with a fresh 21.10 installation on a different SSD. lemp10 and darp7 are very similar internally, and the SSD was also a Samsung (darp7 has 980 Pro, lemp10 has 970 Evo Plus), but that at least rules out any software config on the darp7. (The darp7 install got its primary keyboard layout for cryptsetup/GDM switched to a different language a while back, so I wanted to confirm that wasn't somehow causing this.)
That's a bummer. Switching to one connection in case that improves stability
Have to make some changes since single-connection downloads hasn't been tested yet
Had two successful recovery upgrades with a single connection now. There's less moving parts here so it should have a higher chance of success in the face of interruptions and cancelations.
Still got the checksum error on darp7 with 253dcb5. 253dcb5-darp7.txt
Perhaps a sync to disk before we get the length of the file between each retry.
No change: 7f10aee-darp7.txt
I'm out of ideas for now
Just spitballing here, but would logging the checksum after each concat and/or logging the checksum of each part downloaded possibly help see where the corruption is happening? Assuming that if the chunks being downloaded are a constant size, then the checksum at a given progress point should match between systems where this is and isn't happening.
Will try that
Here are logs with checksums from b6d88a1: b6d88a1-darp7.txt
Some issues at the moment with the transition. I'll notify when it's ready again.
Okay. The transition from using the isahc HTTP client to reqwest is ready for testing.
Still got the checksum error on 154dd2f: 154dd2f-darp7.txt
Comparing this log with the last one, I see that the concat checksums matched up to 1D88AAC5E6D859B7E5A1091728A7C34D
, then differed after the fourth (last) manual cancel on the earlier run. After that, the next concat from the earlier run was 7969B161E66449732D6A10C0BD7B2DDD
, which is not present in the later run. That last cancel would have been resumed from the GUI (with everything before then being performed on the CLI), that's the only unique thing about it I can think of compared with the earlier cancels.
Between cancels, have you noticed any cancels that had a gap or overlap in the bytes range?
get_many.rs:54: GET "/var/cache/pop-upgrade/recovery.iso" bytes=3137339392-3138715647
Seems to have resumed correctly
GET "/var/cache/pop-upgrade/recovery.iso" bytes=872415232-889192447
lib.rs:291: inner_request fetch result is Err(Canceled)
GET "/var/cache/pop-upgrade/recovery.iso" bytes=872415232-889192447
8c20466 still getting wrong checksum: 8c20466-darp7.txt
Failure on 052f3e4: 052f3e4-darp7.txt
Comparing 052f3e4 to 154dd2f, the first 52 chunks (up to ED8A84F04F1806F4CB9C6D7E15C62EEB
) had the same checksum, then 052f3e4 got B33EFDB19456767A71C58CA24CE6B93E
which was not present in 154dd2f (so is presumably incorrect.) This one happened after the second-to-last cancel/resume, which would not have been after switching to the GUI, but just Ctrl-C and then running the command again in the terminal.
Interestingly, the four chunks 052f3e4 got during that period were all mismatching the other log, but then after resuming from the GUI, it got A2BC4F171FB20360F3301B6CACF78A80
, which is present in the other log again. From there on, the rest of the chunks have consistent checksums with the earlier attempt, but the final checksum was mismatched. (Not sure if logging the checksum of the actual final file before or after concats would add any additional value over just checksumming the parts themselves.)
Here's a log of my own from another successful upgrade. I've increased data logged
Failure log from 5499263: 5499263-darp7.txt
Our checksums on the concat lines now already don't match on the first concat:
Mar 02 16:37:43 pop-os pop-upgrade[933]: [DEBUG] get_many.rs:55: RANGE /var/cache/pop-upgrade/recovery.iso:0 - bytes=0-16777215
Mar 02 16:37:43 pop-os pop-upgrade[933]: [DEBUG] get_many.rs:55: RANGE /var/cache/pop-upgrade/recovery.iso:1 - bytes=16777216-33554431
Mar 02 16:37:51 pop-os pop-upgrade[933]: [DEBUG] concatenator.rs:44: CONCAT /var/cache/pop-upgrade/recovery.iso:0 52D54746144D366C200EA41BBF48A724 16777216 bytes
[DEBUG] get_many.rs:55: RANGE /var/cache/pop-upgrade/recovery.iso:0 - bytes=33554432-50331647
[DEBUG] get_many.rs:55: RANGE /var/cache/pop-upgrade/recovery.iso:1 - bytes=50331648-67108863
[DEBUG] concatenator.rs:44: CONCAT /var/cache/pop-upgrade/recovery.iso:0 BE7163C1E37763F6084BF76015ED2958 16777216 bytes
Based on the ranges shown, it looks like yours was already partway done before the log started? I don't see the first few checksums from your log anywhere in mine, although the last few do match (which suggests to me that it's still the individual pieces' checksums being logged.)
Yeah the first few parts were already fetched
Yeah, this is strange. The fetched ranges from the darp7 are wrong for many of the fetches. At random points inbetween the checksums differ, and sometimes matches
Updated logs (full): gaze15.txt
Failure log from 29526df: 29526df-darp7.txt
Failure log from 750ddc3: 750ddc3-darp7.txt
Closing this for now
Utilizes our async-fetcher crate for fetching the recovery ISO. You can test it by running
Then
Ctrl + C
somewhere in-between the download to cancel the process.Running the same command will resume where it left off.
Refactors some of the daemon service to utilize and support tokio. Long term dbus and dbus-crossroads should be replaced with zbus, and the service rearranged to be more async-friendly.