managedsoftwareupdate not resuming partial downloads; produces persistent error until partial file removed

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

1. launch managedsoftwareupdate --checkonly
2. interrupt a managedsoftwareupdate when it begins to download a new dmg by 
either pulling the network cable or sending a SIGINT to the process
3. launch managedsoftwareupdate a second time and note that the attempt to 
re-download the file produces an error

What is the expected output? What do you see instead?

The curl process managedsoftwareupdate uses to fetch a dmg should be able to 
resume an interrupted download or remove any partial files it cannot resume.

Instead, managedsoftwareupdate will continue to fail downloading the dmg until 
the partial has been removed for the cache.

Original issue reported on code.google.com by daygloje...@gmail.com on 3 Feb 2011 at 11:13

GoogleCodeExporter commented 8 years ago

Original comment by daygloje...@gmail.com on 3 Feb 2011 at 11:15

Changed state: New

GoogleCodeExporter commented 8 years ago

I forgot to say which build this was discovered in...

Version: 0.7
Build: r1004

Original comment by daygloje...@gmail.com on 4 Feb 2011 at 7:49

GoogleCodeExporter commented 8 years ago

What web server are you running? Are any normal modules turned off?

If your web server refuses the byte range command you will see this behaviour. 
Munki doesn't currently catch exit code 33 - it is a special case that prevents 
use of the existing partial download.

"curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume."

I'm not seeing this fault against my webserver (Apache2 on Linux) or against 
Akamai when testing with the underlying curl commands. I also don't see this 
behaviour when doing functional testing with munki (kill curl part way through 
a managedsoftwareupdate run). 0.7.0 Build 1005.

In looking at this I note a different bug which I will think about ... resuming 
partials does not currently use the etag to ensure it is resuming the same 
version of the file. The etag is currently only used to check that previously 
fully-downloaded files are still fresh.

Original comment by rrmiddleton@gmail.com on 5 Feb 2011 at 11:46

GoogleCodeExporter commented 8 years ago

It's a stock Mac OS X Server (Apache/2.2.15) but we are utilizing munki server 
(https://github.com/jraine/munkiserver) to host the repo. This is a Rails app 
using Passenger/Apache, but byte range requests should be handled by Apache, no?

I am fairly certain we have not disabled any standard modules.

Still, this may be specific to our environment.

Original comment by daygloje...@gmail.com on 6 Feb 2011 at 2:23

GoogleCodeExporter commented 8 years ago

After a bit of reading, I think this may be a defect in munki server or a 
limitation of Passenger.

I am going to change the status of this issue to Low while we review this.

Thanks.

Original comment by daygloje...@gmail.com on 6 Feb 2011 at 2:33

Added labels: Priority-Low
Removed labels: Priority-High

GoogleCodeExporter commented 8 years ago

It will be a defect with munki-server. Apache & Passenger are just acting as 
proxy to a stream in this case. It is up to the dynamic code to handle all 
headers when acting as a stream. While this is a bug in munki too, munki really 
does need the server to support byte-range requests if downloading large 
payload - particularly if serving to laptops that might change wifi access 
points and thus be trashing tcp connections as they move, or to remote clients 
on slower links.

Apache can only do byte-range if it is the program accessing the file from disk 
(and thus knows everything about the file, can start reading at any point, and 
can generate an eTag that it trusts).

Note - munki does not expect byte-range to be supported for manifests or 
catalogs (not resume=true when called in updatecheck). Thus these are easy to 
generate with dynamic code, no need to worry about implementing partial 
requests. I expect in most cases pkgs can (and should) happily be served by an 
appropriately protected apache directory.

Opinion on static content / dynamic web-apps / auth:
I currently do auth for pkgs with client certificates (thus at the Apache 
auth-wrapper level, not application level). In other applications I write auth 
at the mod_perl or mod_python level (PerlAuthenHandler, PerlAuthzHandler), this 
allows custom auth while allowing Apache to still do all the other fancy HTTP 
stuff for static content. It also substantially improves security for dynamic 
apps - the underlying application code is not able to be hit by 
non-authenticated clients, they can only hit the authentication wrapper (tiny 
and auditable).

Original comment by rrmiddleton@gmail.com on 6 Feb 2011 at 3:31

GoogleCodeExporter commented 8 years ago

Great summary, Rob. Thanks.

Do you think it would be advisable to add some contingency code that would 
remove the partial and attempt a re-download in a case where it could not 
resume?

Original comment by daygloje...@gmail.com on 6 Feb 2011 at 4:38

GoogleCodeExporter commented 8 years ago

Brian: see if r1026 addresses this issue for your server. Expected behavior is 
that it attempts to resume, catches the 33 error code from curl and removes the 
partial download. The next download attempt should then succeed.
So you'll see one failed attempt to resume the download. The next attempt 
should start over from scratch.

Original comment by gregnea...@mac.com on 8 Feb 2011 at 6:33

GoogleCodeExporter commented 8 years ago

I think that ought to do it. Thanks, Greg.

Behavior is now: attempt a resume, on error fail. On subsequent run, attempt a 
new download.

One problem: I noticed that r1026 was not returning its build number...

[11:02:33]van-daphne# managedsoftwareupdate -V
0.7.0 Build svn

I wanted to begin deploying the 0.7r1004 build today. Any reason I should not 
sub the r1026 build in its place?

Original comment by daygloje...@gmail.com on 8 Feb 2011 at 7:06

GoogleCodeExporter commented 8 years ago

Greg, I'd be tempted to log a specific warning if a byte-range request fails 
(error 33) before throwing the exception.

Perhaps munki should expect these minimum capabilities from a web server:
- resume / byte-range requests (for pkgs, but not necessarily for manifests or 
catalogs)
- etag returned in all cases (full or partial download)

If the web server fails on one of these munki should log a warning. These cause 
less ideal behaviour in munki, it could be good to have a 'don't blame munki - 
your web server is broken' warning in the munki logs.

Additionally - if no objection, I might implement saving the etag of a partial, 
and then only resuming a partial if the etag matches. Currently munki will 
resume a partial with no checking of whether the file on the server has 
changed. The SHA checksum will of course stop any major issue, but it would be 
good to not resume a download when we know it will be corrupt.

Original comment by rrmiddleton@gmail.com on 8 Feb 2011 at 7:25

GoogleCodeExporter commented 8 years ago

These sound like reasonable changes.  I'm currently in panic mode, getting 
ready for a trip to Oslo, Norway. Would you like to take a crack at 
implementing these?

I would not want munki to log either of these warnings more than once a 
session; it would be annoying to see dozens of "WARNING: HTTP server foo does 
not support Etags" in the logs.

Original comment by gregnea...@mac.com on 8 Feb 2011 at 7:46

GoogleCodeExporter commented 8 years ago

Attached patch enhances a few things:
- if an etag is present in a partial only resume the download if it matches
- if resume if not possible immediately retry a full download
- log if eTag is missing for downloads we expect to be resumable (ie: pkgs)
- log if range / resume fails.
- delete the partial download in a few other error cases
- actually log the curl exit code / error string on all download failures.
- don't crash munki if the curl error string is not in expected form

Code is tested against apache - with various forms of failures causing 
partials, with etags on/off and range capability on/off. All working without 
crashes or loops.

TODO comments are left in the code where I'm not certain of the best behaviour. 
I would appreciate comments before committing (without quite so many code 
comments). No hurry.

Slight defect - the etag xattr could probably be written as soon as we finish 
with the headers (don't wait for curl to exit). However I'm not certain the 
exact moment curl creates the output file.

The code was already pretty solid for everything but the various possibilities 
that might cause curl to fail repeatedly with partials & servers that don't 
deal with the HTTP/1.1 features we want.

Original comment by rrmiddleton@gmail.com on 12 Feb 2011 at 3:58

Attachments:

curl-r1028.patch

GoogleCodeExporter commented 8 years ago

Hey, Rob -- just now got a chance to look at this. I agree in concept with the 
changes, but can't get the patch to apply to the current version of 
updatecheck.py -- parts of the patch apply, and others are rejected. Can you 
re-do this against the current version of updatecheck.py?

More specific comments;

I'm all for providing more info when a download fails. (Lines 119-124 of the 
patch)

I like the conversion of a resume to a full download if the server doesn't 
appear to support byte range requests. (Lines 80-98 of patch)

...and you've neatly solved the issue of repeat logging of Etag/Byte-range 
request errors.

-Greg

Original comment by gregnea...@mac.com on 25 Feb 2011 at 6:36

GoogleCodeExporter commented 8 years ago

Hi Greg, hope Europe treated you well.

Patch against 1047 attached. Looks like some end of line whitespace changed 
recently, some spurious patch lines.

Note I am now uncertain in adding support for etag checking on resuming. The 
patch provides for deleting the partial rather than resuming if the etag does 
not match.

The changes discussed in issue 60 mean etag may not be so useful. If a partial 
download has happened from one http server (or a file:// server) should we be 
able to resume from another source (future code)? If so, the only thing we 
expect to be the same between copies is the modified time and size (and the sha 
checksum which we don't know until we complete the download).

Perhaps the original behaviour is simplest here -- if there is a partial 
download check nothing, just try to keep downloading. Fancy tricks won't hold 
so well in the face of ad-hoc clustered servers for pkgs (either multiple 
http(s) or file).

Original comment by rrmiddleton@gmail.com on 25 Feb 2011 at 10:15

Attachments:

curl-1047.patch

GoogleCodeExporter commented 8 years ago

Patch committed in r1051. We can revisit the resume vs etag as we integrate the 
file:// stuff.

Original comment by gregnea...@mac.com on 2 Mar 2011 at 7:21

GoogleCodeExporter commented 8 years ago

Brian  - is this issue fixed?

Original comment by gregnea...@mac.com on 22 Apr 2011 at 3:33

GoogleCodeExporter commented 8 years ago

No reply from Brian, but believe this is fixed.

Original comment by gregnea...@mac.com on 12 May 2011 at 5:12

Changed state: Fixed

gonchik / munki

managedsoftwareupdate not resuming partial downloads; produces persistent error until partial file removed #57