Closed polarathene closed 2 years ago
Oh that's a great idea! We don't cache responses yet. Didn't even think about using a file for storing these responses yet, but I'm all for it. We could use sled for that, which looks like the perfect fit as it's a file-based, high-perf key-value store written in Rust but it doesn't support key expiry out of the box. We either have to implement that ourselves or look for a different crate. In any case, a pull request would be amazing. π
You could store the expiry in seconds (probably just needs to be an arg actually) within the file store along with a unix timestamp to know if the file should be used for cache or discarded.
I haven't written rust in a while but std::time::SystemTime.duration_since()
looks like it might be sufficient. Otherwise I guess there's chrono
which isn't too heavy AFAIK to include for UTC ISO8601 string storage and convenience methods?
The file in the CI can be cached easily enough, eg with Github Actions, a prior step would be:
- name: Cache Responses
uses: actions/cache@v2
with:
path: <path to cache file from lychee>
key: lychee.cache
The lychee-action
could also leverage that internally (available as NPM package) to pass some input args (or hash of them) if desirable which could help with cache invalidation I guess? (as the file is cached/restored by CI instead of committed to the repo, where you'd normally get a hash of the file contents)
EDIT:
but it doesn't support key expiry out of the box.
I just realized you mentioned key expiry specifically as the concern! :sweat_smile:
I had a more naive approach in mind. Just caching the status of URLs to disk to persist between CI runs and invalidating the cache when now > (before + expiry)
. The expiry is tied to the cache file itself rather than individual requests/URLs.
That is the before
value is from the original run that created the cache file, and now
is time with subsequent runs. You could update the file with new URLs to cache and that would still be fine, they might be more short-lived, but the global expiry
value ensures that they're still cached effectively without complicating per URL expiry offsets (and a larger file).
An expiry of 3600
would just mean any newly cached URLs at the 3599
mark would expire and need to be cached again, and that only matters if there is another run that following duration (hour); otherwise every URL needs to be re-cached anyway. Therefore you have a guarantee that the URLs that are successful won't be checked more than twice within the expiry
duration window.
This caching behavior may cause some false positives depending on activity (for internal or external links) that might cause the links to actually be unreachable but not caught, but for a PR I think that is uncommon and a CI event can do a full scan without cache via a related event such as reviewer approving for merge.
In any case, a pull request would be amazing. :sunglasses:
I'd give a shot, but I'm pretty swamped as-is right now :disappointed:
Just wanted to detail it here as lychee
looks quite nice and it might interest someone else here.
I'd give a shot, but I'm pretty swamped as-is right now π
No worries. Thanks to the detailed write-up; I'm sure it will help implementing the functionality in the future.
I'm all in favor of keeping this as simple as possible.
I don't even think we'd need sled for that. Perhaps just a serialized version of the Status
struct + a timestamp would do. π
An impl of the things discussed plus some additional comments is here: #443.
I'll close this issue because the other one contains my TODO list and the PR code. Let's move the discussion over there as well. Thanks for the initial idea! π
I've seen another link checking tool mention response caching:
Which is also paired with a cache age option:
I don't see it documented here if responses are cached by default implicitly. If not, is this something that could help with
I did see this discussion about an extractor pool with the benefit of de-duping URL checks. That's a similar benefit I guess, although if there were a related cache data file that could be persisted (with an expiry timestamp), one could persist this with CI cache tooling across runs.
That would be useful for PRs triggering tests each time commits are pushed to update the PR, reducing concerns about Rate Limits being hit? (since the likelihood of a URL, at least ignoring internal ones if local filesystem URI were later supported; would probably still be successful within a given duration between runs)