Cache `ETag` and `Last-Modified` headers

laurelmay commented 3 years ago

Caching these headers gives two pretty significant benefits: the first is that we improve performance a bit by caching these fields and the file hash, the second is that we reduce the need for the upstream servers to send the full files in the response to the GET request. The cache is preserved using the actions/cache@v2 Action. This will work fine since the goal of this check is to find files that have either had their hash change or that have disappeared.

Cache Preservation

The cache is preserved between executions using actions/cache@v2. The cache is preserved for up to 7 days, so we'll be able to rely on it so long as we run the lint at least about that often. We also need to specify a unique cache key for each execution because when the Action has an exact cache hit, it doesn't write the cache back. Using a unique key each time but with a common restore key prefix allows us to restore the most recent cache and also write it back each time. If we do lose the cache, it's not a big deal. We just run with an empty cache and write it back again at the end.

Cache Contents

The cache stores the following attributes for each URL:

the ETag header returned in the response
the Last-Modified header returned in the response
the SHA1 hash of the response body

We preserve both the ETag and the Last-Modified headers because some servers (like Finch's) don't respond with an ETag. This lets us have a fallback to try to use the hash. And the hash itself is preserved because with a 304 response, we don't get a body. So we need to either cache the full body (which takes way more storage) or just the hash (which is far easier). We always preserve the hash of the response; we don't try to preserve the expected hash. This means that if you receive an invalid hash, it should fail time after time (so long as the ETag doesn't change) because you've stored the hash.

Command Output

This adds an additional line at the beginning and the end of the script execution that gives information on the data that was read from and written to the cache. This data should be fairly static between executions unless there's a cache miss. It should be helpful to always have the output for debugging in case we run into a cache issue.

Cache Location

The cache is stored at ~/.cache/hashlint/cache.json. This keeps it in a directory that still should be accessible or should be able to be created when executing the script locally. For simplicity, we don't try to pull in XDG Dirs configuration.

laurelmay commented 3 years ago

Did a quick update to fix a type hint and to update the commit message which dismissed review. Also realizing I mistakenly made a branch with the same name on this repo when working on this yesterday. That probably needs to be deleted.

laurelmay commented 3 years ago

I am good with this but I want to wait to merge it until we'll be able to successfully hit all 6 items on the first go. Eclipse seems to be having an outage today so lets plan to merge after https://www.eclipsestatus.io goes green

jmunixusers / cs-vm-build