Closed mmstick closed 2 years ago
1d85ded failed to build on all platforms. Also fails to build locally.
error[E0107]: this struct takes 0 lifetime arguments but 1 lifetime argument was supplied
--> src/main.rs:193:26
|
193 | async fn main_(matches: &ArgMatches<'_>) -> anyhow::Result<()> {
| ^^^^^^^^^^---- help: remove these generics
| |
| expected 0 lifetime arguments
|
note: struct defined here, with 0 lifetime parameters
--> /home/system7/upgrade/vendor/clap/src/parse/matches/arg_matches.rs:71:12
|
71 | pub struct ArgMatches {
| ^^^^^^^^^^
error[E0581]: return type references lifetime `'_`, which is not constrained by the fn input types
--> src/main.rs:193:45
|
193 | async fn main_(matches: &ArgMatches<'_>) -> anyhow::Result<()> {
| ^^^^^^^^^^^^^^^^^^
Some errors have detailed explanations: E0107, E0581.
For more information about an error, try `rustc --explain E0107`.
warning: `pop-upgrade` (bin "pop-upgrade") generated 20 warnings
error: could not compile `pop-upgrade` due to 2 previous errors; 20 warnings emitted
make[2]: *** [Makefile:75: target/release/pop-upgrade] Error 101
make[2]: Leaving directory '/home/system7/upgrade'
make[1]: *** [debian/rules:18: override_dh_auto_build] Error 2
make[1]: Leaving directory '/home/system7/upgrade'
make: *** [debian/rules:7: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2
Should be fixed now.
If I start the upgrade from a terminal, cancel it with Ctrl-C, then open the GUI and try to download it there, I get a process has been cancelled
error:
It seems to work the second time I try after having cancelled, but not the first. (Also noticed while testing that the CLI progress incorrectly goes to 100%/fully fetched after cancelling partway through, as is visible in the screenshot.)
Resuming does seem to work, so far.
On 5b0240f, both of the described issues are resolved. Resuming from cancelled downloads is still working.
While testing behavior when switching network connections (e.g. plugging in/unplugging Ethernet while also connected to WiFi), I saw that sometimes (after the ~second connection change and onwards) the reported progress is doubled when the download starts progressing again. In the GUI, this results in impossible progress of greater than 100% being reported:
In the CLI, the download seemed to stop when the incorrect progress number reached 100%, followed by errors in journalctl about the recovery files failing to extract:
Powering off and then attempting to access Recovery with the spacebar shows that it was rendered unbootable by that error.
After the errors (before powering off), I also had three of what seem to be the bad recovery image showing up in Nautilus:
Automatic retries aren't resetting progress. Will fix.
The progress should now get properly reset when a retry is being performed. The the result of the recovery fetch task wasn't being returned, which caused it to believe that the task was Ok.
On a103309 with the CLI, the number fetched is now going back down to 0 when I cancel with Ctrl-C. That's less of a misrepresentation than jumping to 100%, but still inaccurate and could be annoying if someone's trying to troubleshoot.
Progress reporting appears to be working correctly, or at least much better, now. The CLI keeps its current number when I cancel with Ctrl-C.
After canceling/resuming and switching network connections a few times, I eventually got a failed to download ISO
error message after unplugging Ethernet:
I clicked Update again, and the download resumed from where it was (near the end), but once it finished, the checksum was invalid:
After that, when I click Update again, it seems to just try and verify what's already downloaded again (which fails again):
We might need to delete the iso if the checksum verification fails? (As for why it was corrupted in the first place, it seemed like the progress might have been jumping slightly after some network change retries, but the amount was so small that it could have also just been e.g. the last bit of progress from before the disconnection not being reported until it's reconnected, so I can't say for certain.)
The first time I tried to update recovery on 0ce6eab, I got the "failed to verify" error again, but the second time, it started downloading from scratch, so it looks like it removes after failure to verify now. I guess it makes sense that we don't want to automatically retry indefinitely, so this behavior looks good.
When resuming after cancelling, I am sometimes seeing a large delay/hang when I try to resume the download:
When I tried to Ctrl-C from that the first time it happened, I continued seeing MessageCall log messages for a minute, although it did eventually exit:
On trying again, the upgrade resumed as I would expect; the next time I resumed, it hung again and didn't exit even after waiting many minutes after pressing Ctrl-C. Every time I pressed Ctrl-C, I saw another cancelling a process which is in progress
log message, but it never actually exited.
The hang seems to always happen the first time I resume after switching network connection, e.g. to recreate:
The one where it never exited was on WiFi, so I'm wondering if network connection/speed has something to do with it. The last one that I recreated on Ethernet exited immediately when I pressed Ctrl-C, with a different error message in the logs:
I had never seen a hang at this point until 0ce6eab, so it could be related to the version of async-fetcher that was changed with that commit.
The change in that commit is that it flushes the file when canceled. Perhaps I should have it flush more often, or do the flush in the background.
Still able to create a hang using the steps listed before on d225885.
Testing out a fix but I think it's working. Will update if it does.
The hang described in https://github.com/pop-os/upgrade/pull/261#issuecomment-1032933185 is no longer occurring! :tada:
On a0c6cd2, I am now unable to cancel with Ctrl-C in the terminal when downloading over Ethernet (seeing the progress go up by about 100 MiB/s). After attempting to cancel, the progress continues going up until I unplug the Ethernet cable, at which point it finally fails/cancels with recovery upgrade aborted: failed to download iso
. This does not happen on WiFi; when I press Ctrl-C on WiFi, the download stops with that message after just a few seconds, without having to disconnect.
On WiFi, the progress still goes up at least one more time after I start the cancel operation. I assume the issue on Ethernet has something to do with the much higher speed (whatever mechanism is supposed to cancel not having time to run/complete.)
I think you're right about the higher speed not giving time to yield to the next future. I'll add some yield calls and see if that helps.
I'm on a decent 16 MiB/s Ethernet connection right now and can cancel as it's downloading, and while the network is disconnected.
I am able to cancel on Ethernet again on 7437eec, but the freeze from https://github.com/pop-os/upgrade/pull/261#issuecomment-1032933185 is now happening again.
I'll ping you when I see all of these instances resolved.
Okay, I'm not seeing any hangs when canceling during network loss, when canceling after networking has been restored, or when resuming after either of those conditions.
Okay, I'm not seeing any hangs when canceling during network loss, when canceling after networking has been restored, or when resuming after either of those conditions.
I know that wasn't a ping, but I'm still seeing the hang on fbdad21... after upgrading pop-upgrade, I rebooted, confirmed the pop-upgrade version, started download on WiFi, cancelled, resumed on WiFi, plugged in Ethernet, cancelled, and attempted to resume.
In case it's useful, when it gets into this "hung" state, if I try to cancel with Ctrl-C, it still waits a while, but it does seem to exit eventually-- and much quicker on Ethernet than WiFi. I get the failed to find build
messages as shown in the last screenshot of https://github.com/pop-os/upgrade/pull/261#issuecomment-1032933185 when it finally does exit after cancelling from a hang.
That is a good clue
@jacobgkau Added a 1 second timeout for a stalled connection for the code that checks for new releases. And changed the way that cancels are performed, with a method that's much quicker about canceling. The pop-upgrade client was also only checking for status updates every 3 seconds, so I've changed that to 0.5 seconds so it appears to hang for much less time.
On a14f6a7, I get an inaccurate "refueled and ready to go" message on cancelling (with a missing line break):
@jacobgkau Cancelations will now return errors and show them correctly in the client.
Looks better now:
CLI:
GUI:
I now had two problems near the end of testing:
On the last network reconnection, a timeout was reached for some reason (I'm not sure why, as I just unplugged Ethernet like I'd already done a number of times before without getting this error): That one's not a blocking problem (it makes sense the download might stop/fail if the network cable is unplugged), just thought I'd mention it.
After cancel/resuming and reconnecting a number of times, I again had a checksum mismatch: This one seems like it could be indicative of an issue with the resume capability. Is there anywhere resuming might not be picking back up/writing correctly?
It is peculiar since I can't trigger it with async-fetcher's own utility with an interrupt. I suspect it's an issue somewhere in how it cleans up after a cancelation. I can continue looking at that area to see if there's a way to improve the shutdown of the future that is concatenating to that file.
This change may help. Canceling the fetcher shouldn't abruptly cancel writes to the ISO that are mid-write. The task will complete the write that it is currently performing, and flush the file before exiting.
Still got a corrupted iso with the same test procedure. Interestingly, (in addition to the earlier timeout after unplugging Ethernet after the window had been closed/opened), I got a timeout this time just before the download finished, even though I hadn't changed network connection immediately before this timeout:
When I clicked Update again, it almost immediately started verifying the image, but the checksum didn't match:
Maybe there's a bug in the HTTP client I'm using with its own internal timeouts. I'll try disabling the low_speed_timeout
configuration I've set.
It seems like 1eb9567 has prevented the download from resuming after a network connection change-- starting on WiFi and plugging in Ethernet, the progress stops and doesn't start going up again.
Network changes do seem to give libcurl grief. I've enabled async-fetcher's own timeout configuration with a 5s timeout on a read from the HTTP server that doesn't progress.
Still able to recreate the timeout after closing/opening Settings during download and then plugging in Ethernet and unplugging it again, and the checksum was invalid again-- back to how things were in https://github.com/pop-os/upgrade/pull/261#issuecomment-1035442187.
When trying to recreate it a second time on the same boot (planning to skip the last set of network switches to avoid the timeout), I ended up having the progress in the GUI stall on WiFi after unplugging Ethernet; after closing and re-opening Settings, I had the green Upgrade button again, and clicking it made the button go away but didn't show the progress again:
To try and narrow down the corruption issue, I've tried downloads in the CLI with only manual cancel/resumes and only automatic network reconnections (the latter gave me one failed to download ISO
error immediately after unplugging Ethernet, which I had to manually resume from); both of those downloads completed and the checksum was valid.
Next, I rebooted, opened the GUI, and tried one with just network disconnects/reconnects. I was also able to recreate an immediate timeout after one of the network switches, and had another immediate timeout when the download was about to finish, but no checksum failure after clicking the Upgrade button again after both of those.
Rebooted again, started the upgrade in the GUI on Ethernet, unplugged & waited for resume, closed Settings, opened it again, plugged Ethernet back in & let it finish-- got an immediate timeout at the end of the download, but no checksum mismatch.
@mmstick Are you able to recreate the timeout at the end of the download with the below steps?
And/or are you able to recreate the checksum mismatch with the below steps?
Seems like there are at least two problems: "timeouts" sometimes occurring in far less than 5 seconds (sometimes right at the end of the download), and the longer sequence resulting in a corrupted download for whatever reason (plus whatever GUI quirk that was which cased a non-working Update button.)
I've made a few changes that have resolved the issues.
So I'm able to dynamically enable and disable Ethernet and WiFi and cancel the download while still getting a successful checksum validation on completion.
Recovery after network connection change is now usually much faster than before.
CLI:
GUI:
When following the above testing procedure, I re-opened the GUI window, plugged in Ethernet, waited for the download to mostly finish, and then unplugged Ethernet; the download stalled at that point.
After closing and re-opening Settings, I had the same issue as in https://github.com/pop-os/upgrade/pull/261#issuecomment-1035643199 where re-opening Settings showed the Upgrade button again, and clicking it didn't start/resume the download.
After manually restarting the pop-upgrade
service, the download resumed in that GUI window without having to click anything else; however, the checksum was mismatched when it finished:
Maybe whatever caused that last stall-out/hang meant that the file didn't get a chance to be flushed.
I think the GUI may be hanging on the release check since it's also using libcurl and there's no network connection detection code in pop-upgrade. I can import the functionality here to catch that.
24f2329 hung when plugging Ethernet back in after re-opening the GUI (one step before where it hung in https://github.com/pop-os/upgrade/pull/261#issuecomment-1036429756, but similar in that it was a network reconnect after re-opening); still had the issue with the Update button disappearing without doing anything, still had a checksum mismatch.
I have something that works to resolve the GTK client issues. Which required gracefully handling service termination from systemd and altering the order that DBus events and DBus status are set to give time for the client to receive the last results before the service is terminated. As well as switching to a different tool for async cancelation that can both signal a cancelation and wait for things to finish canceling that have requested to delay the shutdown.
Found issue with package updates. Calling pop-upgrade release update
, canceling, then redoing it blocks the daemon.
Issue resolved
Cancelling on the CLI is back to saying failed to download ISO
instead of process has been cancelled
, is that expected? It was more clear when it said "cancelled."
Error message is now
Recovery upgrade status: recovery upgrade aborted: fetching from https://pop-iso.sfo2.cdn.digitaloceanspaces.com/21.10/amd64/nvidia/6/pop-os_21.10_amd64_nvidia_6.iso failed: task was canceled
CLI:
GUI:
After the above testing procedure (same one as the latter described in https://github.com/pop-os/upgrade/pull/261#issuecomment-1035643199), I still get a bad checksum:
In the CLI, it also looks like the final download progress doesn't get reported before the output moves on to calculating the checksum (I was confused why it was checksumming when the download wasn't finished yet):
(Ignore the ^[[15~^[[17~
in the output, I just hit the wrong key while trying to printscreen.)
CLI:
GUI:
fcec0f2 hung when attempting to switch network connection for the second time, from Ethernet to WiFi. Once it hung, Ctrl-C did not stop the process either.
After killing pop-upgrade and restarting it, clearing cache, and trying again, I was able to get farther, but hung the first time I tried switching from Ethernet to WiFi in the GUI:
CLI:
GUI:
Closing and re-opening the GUI at this point gave me the green Update button again, and upon clicking it, I get no progress again:
That's a big regression for such a small change. Guess I'll revert it.
Maybe this'll help. Added a shutdown to the concatenator task
CLI:
GUI:
269c98c fixed the hangs, but still getting the checksum mismatch.
I have only one idea for the mismatch. I'm thinking it's possible that a redownload begins to happen while the last fetch is still in the process of handling a shutdown. So I'll figure out a way to prohibit a resume until all spawned tasks are finished.
Utilizes our async-fetcher crate for fetching the recovery ISO. You can test it by running
Then
Ctrl + C
somewhere in-between the download to cancel the process.Running the same command will resume where it left off.
Refactors some of the daemon service to utilize and support tokio. Long term dbus and dbus-crossroads should be replaced with zbus, and the service rearranged to be more async-friendly.