iterative / dvc.org

📖 DVC website and documentation
https://dvc.org
Apache License 2.0
336 stars 393 forks source link

improve link checker (possible rewrite) #1838

Open shcheklein opened 4 years ago

shcheklein commented 4 years ago

Link checker seems to be broken in certain scenarios, has false positives, noisy reporting:

Test on fork, test of force push.

shcheklein commented 4 years ago

Another failure today: https://github.com/iterative/dvc.org/actions/runs/289778559

And I got an email from my name.

iesahin commented 3 years ago

There are hardcoded links in the form of https://dvc.org/doc/... in blogs. When an item changes in doc and not yet deployed, these are checked from the deployed version in dvc.org. This may be related to some failures.

I can update all links to https://dvc.org/doc to internal links in the form of /doc/.

shcheklein commented 3 years ago

https://dev.to/bnb/markdown-link-checking-in-github-with-actions-5dp0 - can be useful potentially

rogermparent commented 3 years ago

https://dev.to/bnb/markdown-link-checking-in-github-with-actions-5dp0 - can be useful potentially

Great resource! Making our own link check is shockingly tough, so leaning on something like Linkinator would be better for everyone if we can do it.

rogermparent commented 3 years ago

I think it's worth noting that Check links with linkcheck is the most starred link check action on the marketplace by far, and it claims feats such as "It is 40 times faster than the only tool that goes to the same depth (linkchecker).".

Once we start work on this, I think it's worth following the stars and giving linkcheck the first attempt, but the retry option in linkinator looks appealing and doesn't seem to have a counterpart in linkcheck, so we should give that a trial to see if it offers advantages we want that are worth possible comparative drawbacks.

jorgeorpinel commented 3 years ago

This is the highest priority issue (somewhat) related to the website. I put it at the top in the project for now. Are there plans to work on it? Should we change the priority otherwise?

Or are there any other website/engine-related issues you have under your radar, e.g. maybe iterative/dvc.org#1063, iterative/gatsby-theme-iterative#156 (would be my personal pick as I keep bumping with that), iterative/dvc.org#2394, iterative/dvc.org#1115, or anything else from here ? Cc @rogermparent @julieg18 thanks

rogermparent commented 3 years ago

I don't think this has even been considered for a while, so I suppose it's not quite a p0.

Unless we decide to pick this up, it may stand to be demoted to p1- the link checker is considered important, so hopefully it doesn't fall by the wayside.

If someone were to pick the link checker back up and fix its core issues described in the OP, it may warrant a rewrite that leans more into the CLI interface and lets programs use that- using GHA Checks directly caused 90% of pain on the original run of that project.

iesahin commented 3 years ago

I can take a look at this after the dataset/scenario updates, in two weeks.

iesahin commented 3 years ago

It looks like the Rust library behind

https://github.com/lycheeverse/lychee-action

is modifiable even if the features don't cover our use cases.

iesahin commented 3 years ago

I can also use the tool @rogermparent mentioned. It's a Dart library and though may not cover everything we need, I can use it in a custom library.

If you would like a link-checker from scratch, I can write it too.

rogermparent commented 3 years ago

@iesahin My plan of attack would probably be:

  1. Attempt a few already-made link checker actions (lychee-action and filiph/linkcheck). If one seems to perform well and reasonably cover our use case, just use it.
  2. If all viable solutions from step 1 still leave us wanting some extra feature, attempt to wrap and/or add the features we want to an existing library.
  3. As a last resort, write our custom logic around an existing link check library (like Lychee)
  4. No matter what, drop the responsibility of maintaining the actual link checking logic because the internet is a nightmare world of inconsistency.

While filiph/linkcheck has more stars than Lychee, I can see Lychee possibly having unique features from the old checker that we want because, as far as I understand, linkcheck parses built HTML and Lychee parses source files like Markdown.

iesahin commented 3 years ago

These steps are similar to what I planned. @rogermparent

I'll test these 2 first and see if they are good enough. I think a thin wrapper around them may be needed at most.

julieg18 commented 2 years ago

I started looking into other link checkers and it looks like lychee, by far, works the best. It has the most customization options and can scan several different types of files along with actual websites. As for how we should use it, I'll test out lychee-action on scanning the entire site and scanning pr differences. We may be able to use it or we'll need to write custom code that manipulates the inputs/outputs of the action.

🤔 Though @rogermparent, are we wanting to replace our custom link-check entirely or have it use something like lychee?

rogermparent commented 2 years ago

are we wanting to replace our custom link-check entirely or have it use something like lychee?

My preference would be to use lychee as much as possible, but some more advanced features like running on git diffs and the whole "allow a link to fail once before reporting it" feature will probably require us to write some sort of wrapper extending lychee's features.

That said, we can stand to change/omit some aspects of the current link check that are harder to replicate- for example, instead of diff mode grabbing links directly from the diff we can just check all links in modified files, since IIRC lychee's language parsing only works on whole files and it's fast enough that the difference between solely added links and the rest of the page shouldn't be a problem.

julieg18 commented 2 years ago

My preference would be to use lychee as much as possible, but some more advanced features like running on git diffs and the whole "allow a link to fail once before reporting it" feature will probably require us to write some sort of wrapper extending lychee's features.

That said, we can stand to change/omit some aspects of the current link check that are harder to replicate- for example, instead of diff mode grabbing links directly from the diff we can just check all links in modified files, since IIRC lychee's language parsing only works on whole files and it's fast enough that the difference between solely added links and the rest of the page shouldn't be a problem.

Makes sense that a wrapper would be needed! Though I'm inexperienced in this and I'm not sure what steps to take to create this "wrapper"...

First thing that comes to mind is running lychee inside of our link-check code. But is there a way to have our nodejs-based link-check run a command-line utility like lychee? I know we can run terminal commands like with exec but how would we install lychee in the first place 🤔

The other way I can think of is to use lychee-action and write steps before and after the action runs, creating the desired inputs and outputs within one github action file. But that sounds more complex than having link-check just use lychee...

Does anyone know of the best way to create this "wrapper"? cc @iterative/websites

rogermparent commented 2 years ago

Does anyone know of the best way to create this "wrapper"?

The best wrapper would be no wrapper, followed by as minimal a wrapper as possible. Hopefully we'll be able to use lychee-action since it should be able to handle downloading lychee into the CI with any required setup. What we do around that depends on the feature in question.

A few examples:

If we want to run only on files added in the diff, I would think the ideal implementation would be a simple GitHub Actions workflow that gets a list of those added files via git diff --name-status and passes those to lychee-action.

For something like keeping track of consecutive fails, we could run lychee-action and just operate on its output file, using our code to simply keep a persistent count of consecutive fails and handles reporting when the amount gets unacceptable.

If we wanted to limit checked links to only those in the diff and ignore other links in modified files, we may be able to use lychee --dump to gather links without checking and pre-process them before passing the list back to lychee via lychee-action. I don't think this particular one would be necessary, but it's an option.

There's also the option of adding lychee to our build process instead of having it in a parallel Actions CI, the primary advantages being having the public folder accessible should we decide to check against build output rather than source files as well as not needing to deal with the deployment_status GHA event nor whatever is required for other build systems.

If we must drop down into a node script that executes lychee, you're correct that installing it could be an issue. From what I can tell, lychee-action actually installs lychee with a Docker image, so that's one option and probably the best way to go since lychee doesn't appear to be in the Ubuntu repos.

julieg18 commented 2 years ago

The best wrapper would be no wrapper, followed by as minimal a wrapper as possible. Hopefully we'll be able to use lychee-action since it should be able to handle downloading lychee into the CI with any required setup. What we do around that depends on the feature in question.

Thanks for the info and examples! I'll start working on a github action that works with lychee-action.

julieg18 commented 2 years ago

Experimented with lychee-action! Looks like we'll be able to use it for checking prs (with the help of a github action like get-changed-files) and the entire repo!

But it has a couple bugs:

The main issue is the 429, and I've got a couple of ideas on how we could fix this:

rogermparent commented 2 years ago
  • some github links fail due to 429 Too many requests errors. It also appears that we get some meetup and notion link failures due to this as well.

It's unfortunate that Lychee seemingly doesn't have per-URL configuration; we got around this with the current link-check by limiting concurrency to 1 and rate-limiting to one GH check per second, but doing so with Lychee would require we either wrap it or limit the rate/concurrency of all links.

Looks like there's another workaround for GitHub specificially where you can use an access token to authenticate and ease the rate limiting greatly. This works for GitHub specifically, but thinking about this happening with other sites is a little worrying.

  • fake links (ones found in code examples for example) are checked, though I think our link-checker has the same issue

We only handle this with the current link checker by using a lot of exclusions, which should be able to translate into Lychee well enough

julieg18 commented 2 years ago

Looks like there's another workaround for GitHub specificially where you can use an access token to authenticate and ease the rate limiting greatly. This works for GitHub specifically, but thinking about this happening with other sites is a little worrying.

We are using the GITHUB_TOKEN that is provided when it comes to gh actions, but would a different one change the outcome?

rogermparent commented 2 years ago

We are using the GITHUB_TOKEN that is provided when it comes to gh actions, but would a different one change the outcome?

I didn't notice that, I see the lychee docs suggest using the GITHUB_TOKEN from Actions so I'd imagine it's supposed to work. Maybe a more proper token would give us more leeway but I doubt it. Might be worth a try if we get a bot token for the automatic publish Issue.

rogermparent commented 2 years ago

I think a big improvement we can get is focusing on CLI, as we can go from there to any other reporting mechanism like GitHub Actions and simplify away the issue of having multiple build types for CLI and GHA as well as possible usage on other platforms.