Open shcheklein opened 4 years ago
Another failure today: https://github.com/iterative/dvc.org/actions/runs/289778559
And I got an email from my name.
There are hardcoded links in the form of https://dvc.org/doc/...
in blogs. When an item changes in doc
and not yet deployed, these are checked from the deployed version in dvc.org
. This may be related to some failures.
I can update all links to https://dvc.org/doc
to internal links in the form of /doc/
.
https://dev.to/bnb/markdown-link-checking-in-github-with-actions-5dp0 - can be useful potentially
https://dev.to/bnb/markdown-link-checking-in-github-with-actions-5dp0 - can be useful potentially
Great resource! Making our own link check is shockingly tough, so leaning on something like Linkinator would be better for everyone if we can do it.
I think it's worth noting that Check links with linkcheck is the most starred link check action on the marketplace by far, and it claims feats such as "It is 40 times faster than the only tool that goes to the same depth (linkchecker).".
Once we start work on this, I think it's worth following the stars and giving linkcheck
the first attempt, but the retry
option in linkinator
looks appealing and doesn't seem to have a counterpart in linkcheck
, so we should give that a trial to see if it offers advantages we want that are worth possible comparative drawbacks.
This is the highest priority issue (somewhat) related to the website. I put it at the top in the project for now. Are there plans to work on it? Should we change the priority otherwise?
Or are there any other website/engine-related issues you have under your radar, e.g. maybe iterative/dvc.org#1063, iterative/gatsby-theme-iterative#156 (would be my personal pick as I keep bumping with that), iterative/dvc.org#2394, iterative/dvc.org#1115, or anything else from here ? Cc @rogermparent @julieg18 thanks
I don't think this has even been considered for a while, so I suppose it's not quite a p0.
Unless we decide to pick this up, it may stand to be demoted to p1- the link checker is considered important, so hopefully it doesn't fall by the wayside.
If someone were to pick the link checker back up and fix its core issues described in the OP, it may warrant a rewrite that leans more into the CLI interface and lets programs use that- using GHA Checks directly caused 90% of pain on the original run of that project.
I can take a look at this after the dataset/scenario updates, in two weeks.
It looks like the Rust library behind
https://github.com/lycheeverse/lychee-action
is modifiable even if the features don't cover our use cases.
I can also use the tool @rogermparent mentioned. It's a Dart library and though may not cover everything we need, I can use it in a custom library.
If you would like a link-checker from scratch, I can write it too.
@iesahin My plan of attack would probably be:
lychee-action
and filiph/linkcheck
). If one seems to perform well and reasonably cover our use case, just use it.While filiph/linkcheck
has more stars than Lychee, I can see Lychee possibly having unique features from the old checker that we want because, as far as I understand, linkcheck
parses built HTML and Lychee parses source files like Markdown.
These steps are similar to what I planned. @rogermparent
I'll test these 2 first and see if they are good enough. I think a thin wrapper around them may be needed at most.
I started looking into other link checkers and it looks like lychee
, by far, works the best. It has the most customization options and can scan several different types of files along with actual websites. As for how we should use it, I'll test out lychee-action
on scanning the entire site and scanning pr differences. We may be able to use it or we'll need to write custom code that manipulates the inputs/outputs of the action.
🤔 Though @rogermparent, are we wanting to replace our custom link-check entirely or have it use something like lychee
?
are we wanting to replace our custom link-check entirely or have it use something like
lychee
?
My preference would be to use lychee
as much as possible, but some more advanced features like running on git diffs and the whole "allow a link to fail once before reporting it" feature will probably require us to write some sort of wrapper extending lychee's features.
That said, we can stand to change/omit some aspects of the current link check that are harder to replicate- for example, instead of diff mode grabbing links directly from the diff we can just check all links in modified files, since IIRC lychee
's language parsing only works on whole files and it's fast enough that the difference between solely added links and the rest of the page shouldn't be a problem.
My preference would be to use lychee as much as possible, but some more advanced features like running on git diffs and the whole "allow a link to fail once before reporting it" feature will probably require us to write some sort of wrapper extending lychee's features.
That said, we can stand to change/omit some aspects of the current link check that are harder to replicate- for example, instead of diff mode grabbing links directly from the diff we can just check all links in modified files, since IIRC lychee's language parsing only works on whole files and it's fast enough that the difference between solely added links and the rest of the page shouldn't be a problem.
Makes sense that a wrapper would be needed! Though I'm inexperienced in this and I'm not sure what steps to take to create this "wrapper"...
First thing that comes to mind is running lychee
inside of our link-check
code. But is there a way to have our nodejs-based link-check
run a command-line utility like lychee
? I know we can run terminal commands like with exec
but how would we install lychee in the first place 🤔
The other way I can think of is to use lychee-action
and write steps before and after the action runs, creating the desired inputs and outputs within one github action file. But that sounds more complex than having link-check
just use lychee
...
Does anyone know of the best way to create this "wrapper"? cc @iterative/websites
Does anyone know of the best way to create this "wrapper"?
The best wrapper would be no wrapper, followed by as minimal a wrapper as possible. Hopefully we'll be able to use lychee-action
since it should be able to handle downloading lychee
into the CI with any required setup. What we do around that depends on the feature in question.
A few examples:
If we want to run only on files added in the diff, I would think the ideal implementation would be a simple GitHub Actions workflow that gets a list of those added files via git diff --name-status
and passes those to lychee-action
.
For something like keeping track of consecutive fails, we could run lychee-action
and just operate on its output file, using our code to simply keep a persistent count of consecutive fails and handles reporting when the amount gets unacceptable.
If we wanted to limit checked links to only those in the diff and ignore other links in modified files, we may be able to use lychee --dump
to gather links without checking and pre-process them before passing the list back to lychee
via lychee-action
. I don't think this particular one would be necessary, but it's an option.
There's also the option of adding lychee
to our build process instead of having it in a parallel Actions CI, the primary advantages being having the public
folder accessible should we decide to check against build output rather than source files as well as not needing to deal with the deployment_status
GHA event nor whatever is required for other build systems.
If we must drop down into a node script that executes lychee
, you're correct that installing it could be an issue. From what I can tell, lychee-action
actually installs lychee
with a Docker image, so that's one option and probably the best way to go since lychee
doesn't appear to be in the Ubuntu repos.
The best wrapper would be no wrapper, followed by as minimal a wrapper as possible. Hopefully we'll be able to use lychee-action since it should be able to handle downloading lychee into the CI with any required setup. What we do around that depends on the feature in question.
Thanks for the info and examples! I'll start working on a github action that works with lychee-action
.
Experimented with lychee-action
! Looks like we'll be able to use it for checking prs (with the help of a github action like get-changed-files) and the entire repo!
But it has a couple bugs:
429 Too many requests
errors. It also appears that we get some meetup and notion link failures due to this as well. The main issue is the 429, and I've got a couple of ideas on how we could fix this:
lychee
retries a link 3 times and waits 1 second before retrying. We could increase the retries and wait time to stop the 429
s but this would increase the wait time of the action since it would take longer to check actually broken links. But lychee
is pretty fast, the overall wait time still might not be too long 🤔 lychee-action
and retry the links ourselves with link-checker
or some other package that checks links. This could save us some time, but may overcomplicate things since we'd have to be running two actions for checking links 🤔
- some github links fail due to
429 Too many requests
errors. It also appears that we get some meetup and notion link failures due to this as well.
It's unfortunate that Lychee seemingly doesn't have per-URL configuration; we got around this with the current link-check by limiting concurrency to 1 and rate-limiting to one GH check per second, but doing so with Lychee would require we either wrap it or limit the rate/concurrency of all links.
Looks like there's another workaround for GitHub specificially where you can use an access token to authenticate and ease the rate limiting greatly. This works for GitHub specifically, but thinking about this happening with other sites is a little worrying.
- fake links (ones found in code examples for example) are checked, though I think our link-checker has the same issue
We only handle this with the current link checker by using a lot of exclusions, which should be able to translate into Lychee well enough
Looks like there's another workaround for GitHub specificially where you can use an access token to authenticate and ease the rate limiting greatly. This works for GitHub specifically, but thinking about this happening with other sites is a little worrying.
We are using the GITHUB_TOKEN
that is provided when it comes to gh actions, but would a different one change the outcome?
We are using the
GITHUB_TOKEN
that is provided when it comes to gh actions, but would a different one change the outcome?
I didn't notice that, I see the lychee docs suggest using the GITHUB_TOKEN
from Actions so I'd imagine it's supposed to work. Maybe a more proper token would give us more leeway but I doubt it. Might be worth a try if we get a bot token for the automatic publish Issue.
I think a big improvement we can get is focusing on CLI, as we can go from there to any other reporting mechanism like GitHub Actions and simplify away the issue of having multiple build types for CLI and GHA as well as possible usage on other platforms.
Link checker seems to be broken in certain scenarios, has false positives, noisy reporting:
Ivan Shcheklein
Test on fork, test of force push.