jgstew / jgstew-recipes

For JGStew's AutoPkg Recipes and Processors
MIT License
8 stars 2 forks source link

TODO: improve URLDownloader to do etag check on prefetched headers #5

Closed jgstew closed 3 years ago

jgstew commented 3 years ago
  1. Check if download is new using prefetched headers
    1. Add a prefetch header flag which prefetch filename implies
    2. If headers are prefetched for any reason, then prefer to compare the prefetched values in python rather than using a curl command with the values sent.
    3. Have options to check for new download based upon any combination of etag, last modfied, or content size.
  2. store etag/lastmod using something other than xattr (better cross platform support)
    1. store it where?
    2. a ".info.json" file? downloaded_file.ext.info.json
  3. Implement a pure python alternative to URLDownloader with these features
    1. Python standard library only or requests or another library? (probably requests)
    2. Python2 compatibility or Python3 only? (probably Python3 only)

Some remote servers don't properly support etag / last modified even if they provide them.

It would be ideal if you could prefetch the headers for the URL, then do the comparison in python to determine if the download is new or not for the cases where the remote server does NOT do this properly. You could also use the Content-Size header in the case of check-file-size-only flag instead of downloading the file first.

It would also be ideal if URLGetter / URLDownloader would have the option to use something other than xattr to store the etag/last modified header values so that this part worked correctly cross platform. One possible option is to check the last run values in the last run receipt and compare to the current values.

Another option all together is to implement something to replace URLDownloader that works with pure python instead of subprocessing cURL.

Related:

jgstew commented 3 years ago

After spending time with the URLDownloader and URLGetter source code, it might be easier to just reimplement this in pure python instead of trying to address these issues there, especially since the issues are across both processors.

jgstew commented 3 years ago

I ended up creating a mostly pure python downloader that doesn't use cURL:

I am also not sure exactly how to implement the prefetch_filename logic, partly from a standpoint of changing when the header prefetching happens in the process in order to do so. I also need to figure out how to get the headers needed to do so. Effectively, the headers are always prefetched, but later in the process.

jgstew commented 3 years ago

Considering this issue complete even though the prefetch_filename is still the old curl method.

Working to file a pull request to get this included in AutoPkgLib https://github.com/autopkg/autopkg/issues/734