sindresorhus / awesome-lint

Linter for Awesome lists
MIT License
611 stars 59 forks source link

Detect 404s and 301s #19

Closed fregante closed 6 years ago

fregante commented 7 years ago

Can this actually poll the links? I'm not sure how it would work on PRs (since problems can appear on existing content) but it can be a great way to keep lists up to date.

sindresorhus commented 7 years ago

See: https://github.com/sindresorhus/awesome-lint/blob/d583502e632eb940c6fd617763268ec0ff6e3534/config.js#L72

fregante commented 7 years ago

Nice! I suppose it used this one: https://github.com/wemake-services/remark-lint-are-links-valid

What should be the resolution? Won't add?

sindresorhus commented 7 years ago

Yes. It had too many false positives last time I tried, but they might have fixed the problems since then. Or maybe there's a better plugin we could use.

transitive-bullshit commented 6 years ago

I'll take a look at existing remark plugins, but I'm leaning towards creating a new plugin that focuses on robustness utilizing p-map and p-retry.

transitive-bullshit commented 6 years ago

@sobolevn @davidtheclark I did some quick experiments against the master awesome list which contains 432 links.

remark-lint-are-links-valid - Takes around 140 seconds to resolve all the links. remark-lint-no-dead-urls - Takes around 3 seconds to resolve all the links.

The implementation of remark-lint-no-dead-urls is significantly simpler as well, though it's missing two features that we would want in a robust link checker:

  1. Retry logic for individual link failures. I've worked with a ton of crawlers, and retry logic at the leaf level is very important for reducing false negatives.
  2. Concurrency control. Currently, it just tries to resolve all the links without limiting the maximum number of HTTP requests that can be going on at once. This will lead to random, difficult to debug errors that we don't want to deal with.

At this point, I'm inclined to either fork remark-lint-no-dead-urls if @davidtheclark agrees these are reasonable changes or just create a similar package inspired by remark-lint-no-dead-urls that is more robust.

davidtheclark commented 6 years ago

@transitive-bullshit I'd be happy to add you as a collaborator and integrate these changes into remark-lint-no-dead-urls. I didn't end up using it as much as I thought I might, so it's been a little stale, but I'm happy to revive it, and might end up using it more soon!

I'm actually interested right now in performing the same link checks in HTML, as well (and I'm sure others would be). So a more broadly useful library might be one that accepts a rehype/HAST tree and checks the links; the remark-lint-* plugin could use it behind the scenes (transform its MDAST tree into HAST then run the nodes through the HAST-checking function). Does that sound right, @wooorm?

cc @colleenmcginnis and @danswick who are interested in link-checking for HTML.

Anyway, @transitive-bullshit, if you want to talk implementation details maybe we should do it in an issue in remark-lint-no-dead-urls. Or if you want to fork and run your own way, I understand :)

transitive-bullshit commented 6 years ago

Thanks for the thorough response, @davidtheclark.

I'm thinking it would make sense to create a focused module that accepts an array of URLs and returns a Promise for a result map of Map<String, LinkCheck. LinkCheckResult> (LinkCheckResult). So the result would be a map from URL to liveness info. Since this module would encapsulate all the functionality for robust liveness checking, any html / markdown / etc consumers would become really simple.

We could call this module check-links or p-check-urls or something else?

cc @tcort

transitive-bullshit commented 6 years ago

I just published check-links that implements my previous suggestion.

@davidtheclark I'm going to prepare a PR to remark-lint-no-dead-urls in the next few days that uses check-links under the hood. Please let me know if you have any feedback 💯

sindresorhus commented 6 years ago

@transitive-bullshit You might want to use got directly instead of link-check. Then you get promise API and retry functionality for free, and there's not much code in link-check anyway, and we don't even need all of its functionality, like email checking.

wooorm commented 6 years ago

This looks really good, y’all!

Some checking module that would benefit both the remark and rehype ecosystem is also super valuable. But we can discuss that somewhere else, @davidtheclark!

transitive-bullshit commented 6 years ago

You might want to use got directly instead of link-check...

Good call; I just refactored to use got directly and created a PR for remark-lint-no-dead-urls.

Feedback welcome 💯 I'm hoping this approach with check-links will be generally useful for remark, rehype, and awesome-lint.