jgstew / jgstew-recipes

For JGStew's AutoPkg Recipes and Processors
MIT License
8 stars 2 forks source link

URLDownloader redownloads files even when they haven't changed #3

Closed jgstew closed 3 years ago

jgstew commented 3 years ago

URLDownloader will redownload the file if either the etag header changed or the last_modified header changed.

I would like the option to check any combination of ["etag", "last-modified", "content-length"] and skip downloads if any one or any 2 of them have not changed.

From what I can tell, URLDownloader redownloads if the etag alone changed. If you specify the check filesize only option, it still redownloads the file in order to do that check it seems.

Using the CHECK_FILESIZE_ONLY option will stop the process for continuing like there is a new download when there isn't, but it will still download the full file to check it seems.

The sending headers for etag / last modified are handled in URLGetter:

https://github.com/autopkg/autopkg/blob/5792fcc09d2c3862dd5e430329d4a9ac88b5d019/Code/autopkglib/URLGetter.py#L79

From here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match

"When used in combination with If-Modified-Since, If-None-Match has precedence (if the server supports it)."

Related:

jgstew commented 3 years ago

The following does work to NOT get the download again:

curl --head -H "If-Modified-Since: Wed, 20 Jan 2021 21:30:18 GMT" https://software.bigfix.com/download/bes/100/util/QNA10.0.2.52.zip
jgstew commented 3 years ago

Seems like there are 2 options:

jgstew commented 3 years ago

I resolved this for the following: https://github.com/jgstew/jgstew-recipes/blob/main/BigFix/FixletDebugger.download.recipe.yaml

By clearing the saved eTag since it is not valid to stop it from being sent in subsequent curl commands:

https://github.com/jgstew/jgstew-recipes/blob/main/SharedProcessors/ClearFileXattr.py

This does NOT resolve the issue for GoToMeeting since their CDN does not support etags or last modified correctly. The only way to resolve this for that case is to fetch the headers only, then check if the download has changed in python, then download the actual file if and only if it changed.