lycheeverse / lychee

⚑ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.2k stars 134 forks source link

Response caching? #163

Closed polarathene closed 2 years ago

polarathene commented 3 years ago

I've seen another link checking tool mention response caching:

URL request results will be cached when true. This will ensure that each unique URL will only be checked once.

Which is also paired with a cache age option:

Default Value: 3_600_000 (1 hour) The number of milliseconds in which a cached response should be considered valid. This is only relevant if the cacheResponses option is enabled.

I don't see it documented here if responses are cached by default implicitly. If not, is this something that could help with

I did see this discussion about an extractor pool with the benefit of de-duping URL checks. That's a similar benefit I guess, although if there were a related cache data file that could be persisted (with an expiry timestamp), one could persist this with CI cache tooling across runs.

That would be useful for PRs triggering tests each time commits are pushed to update the PR, reducing concerns about Rate Limits being hit? (since the likelihood of a URL, at least ignoring internal ones if local filesystem URI were later supported; would probably still be successful within a given duration between runs)

mre commented 3 years ago

Oh that's a great idea! We don't cache responses yet. Didn't even think about using a file for storing these responses yet, but I'm all for it. We could use sled for that, which looks like the perfect fit as it's a file-based, high-perf key-value store written in Rust but it doesn't support key expiry out of the box. We either have to implement that ourselves or look for a different crate. In any case, a pull request would be amazing. 😎

polarathene commented 3 years ago

You could store the expiry in seconds (probably just needs to be an arg actually) within the file store along with a unix timestamp to know if the file should be used for cache or discarded.

I haven't written rust in a while but std::time::SystemTime.duration_since() looks like it might be sufficient. Otherwise I guess there's chrono which isn't too heavy AFAIK to include for UTC ISO8601 string storage and convenience methods?

The file in the CI can be cached easily enough, eg with Github Actions, a prior step would be:

- name: Cache Responses
  uses: actions/cache@v2
  with:
    path: <path to cache file from lychee>
    key: lychee.cache

The lychee-action could also leverage that internally (available as NPM package) to pass some input args (or hash of them) if desirable which could help with cache invalidation I guess? (as the file is cached/restored by CI instead of committed to the repo, where you'd normally get a hash of the file contents)


EDIT:

but it doesn't support key expiry out of the box.

I just realized you mentioned key expiry specifically as the concern! :sweat_smile:

I had a more naive approach in mind. Just caching the status of URLs to disk to persist between CI runs and invalidating the cache when now > (before + expiry). The expiry is tied to the cache file itself rather than individual requests/URLs.

That is the before value is from the original run that created the cache file, and now is time with subsequent runs. You could update the file with new URLs to cache and that would still be fine, they might be more short-lived, but the global expiry value ensures that they're still cached effectively without complicating per URL expiry offsets (and a larger file).

An expiry of 3600 would just mean any newly cached URLs at the 3599 mark would expire and need to be cached again, and that only matters if there is another run that following duration (hour); otherwise every URL needs to be re-cached anyway. Therefore you have a guarantee that the URLs that are successful won't be checked more than twice within the expiry duration window.


This caching behavior may cause some false positives depending on activity (for internal or external links) that might cause the links to actually be unreachable but not caught, but for a PR I think that is uncommon and a CI event can do a full scan without cache via a related event such as reviewer approving for merge.


In any case, a pull request would be amazing. :sunglasses:

I'd give a shot, but I'm pretty swamped as-is right now :disappointed:

Just wanted to detail it here as lychee looks quite nice and it might interest someone else here.

mre commented 3 years ago

I'd give a shot, but I'm pretty swamped as-is right now 😞

No worries. Thanks to the detailed write-up; I'm sure it will help implementing the functionality in the future.

I'm all in favor of keeping this as simple as possible.

I don't even think we'd need sled for that. Perhaps just a serialized version of the Status struct + a timestamp would do. 😊

mre commented 2 years ago

An impl of the things discussed plus some additional comments is here: #443.

mre commented 2 years ago

I'll close this issue because the other one contains my TODO list and the PR code. Let's move the discussion over there as well. Thanks for the initial idea! 😊