lycheeverse / lychee

âš¡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.18k stars 132 forks source link

Enable excluding files / directories #470

Closed norswap closed 3 weeks ago

norswap commented 2 years ago

This might be a case of me being particularly dense, but there doesn't seem to be a way to exclude directories and files from being parsed by lychee (other than by not including them in the inputs). The existing exclude flags are all about patterns of links not to consider.

Why this matters: on my machine, lychee is quite slow at trudging through e.g. .git and node_modules to look for files that either aren't there or I don't want checked. (There are external reasons why it's slow, not least of which is using WSL.)

Still, as is, I'm forced to write:

lychee --exclude-mail README.md "./specs/**/*.md" "./meta/**/*.md" "./opnode/**/*.md"

What I'd like to write:

lychee --exclude-mail --exclude-dir .git node_modules -- **/*.md

Ideally the "file patterns" should work just like .gitignore.

mre commented 2 years ago

We also touch on that in https://github.com/lycheeverse/lychee/issues/418. I agree that there needs to be a solution. ripgrep excludes .git and node_modules by default, which sounds sensical to me. Then in your case it would be

lychee --exclude-mail .

(Not exactly, because it would check html files as well, but that could be configurable as well.)

--exclude-dir might be a bit too narrow, because one might also want to exclude files. Then --exclude-path makes sense, because it covers both, but we also have --exclude-file, which is currently interpreted as a file with regex patterns for excluding URLs. That was a misnomer and will be deprecated in favour of --use-ignore-file. With that, --exclude-path could work and support both directories and files.

lebensterben commented 2 years ago

@mre I fully agree that lychee should be similar to ripgrep when dealing with hidden directories.

It may even by default ignore anything in ".gitignore".

san-slysz commented 2 years ago

I fall into the same situation, where I wanted to ignore (at least) node_modules. Ignoring the gitignore list would make sense to me.

aerfio commented 2 years ago

This feature would be really helpful, right now I do something like

git ls-files '*.md' | xargs -n 1 lychee --

but I'd prefer some kind of way to ignore whole directories

mre commented 1 year ago

Update

--exclude-path exists now. It allows excluding files and directories from being checked.

Usage example based on the original request above:

lychee --exclude-path node_modules .git -- .

Regex patterns are supported. Some more info on the lychee website

@norswap, as a side note, did you know that there's a windows executable build, which could help you with any performance issues because you could avoid the WSL virtualization layer?

TODO

norswap commented 1 year ago

Great to hear!

I was weak and purchased a mac :D But I think generally execution isn't really the problem with WSL performance, it's file system accesses.

aj-stein-nist commented 1 year ago

Regex patterns are supported. Some more info on the lychee website

I may have to file a potential bug but we are big fans of lychee, I spent the last two days in between other tasks unable to get regex working at all with --exlcude-path, but I will need to consult with all of you if I am using it correctly.

mre commented 1 year ago

Sweet. I've added some more examples to the docs to help you get started. Feel free to add a comment here if you run into an issue.

aj-stein-nist commented 1 year ago

Sweet. I've added some more examples to the docs to help you get started. Feel free to add a comment here if you run into an issue.

OK, I will not finish the draft of the separate bug report I was writing, I will move and edit it to here. We want to dynamically add or remove a collection of web pages per directory, and the directory name is based upon git branch or tag names. I only bring this up because I cannot use .lycheeignore or the config method as the directories will be dynamically. We want to filter them out and --exclude-path=site/public/models would be best. This is the directory structure, see usnistgov/OSCAL-Reference#23 for fuller details and current development branch.

https://github.com/usnistgov/OSCAL-Reference/tree/fb13809fea9baf44b9d0f341f694ee1dae66e864/site

Inside of site directory, based upon the Makefile executed a directory above site (./site/..) it would generate the source-code content into the rendered site into site/public. We scan links from there.

When cloning our code, I attempted to configure different variations of * before and after the relative path.

lychee --exclude-file ./support/lychee_ignore.txt  \
  --config ./support/lychee.toml \
  --output lychee_report.md \
  --verbose --format markdown \
  --exclude-path="site/public/models*" \
  site/public/**/*.html
lychee --exclude-file ./support/lychee_ignore.txt  \
  --config ./support/lychee.toml \
  --output lychee_report.md \
  --verbose --format markdown \
  --exclude-path="*site/public/models*" \
  site/public/**/*.html
lychee --exclude-file ./support/lychee_ignore.txt  \
  --config ./support/lychee.toml \
  --output lychee_report.md \
  --verbose --format markdown \
  --exclude-path="*/models*" \
  site/public/**/*.html

I still see lychee linkcheck failures that are under site/public/models/develop and other subdirectories that should be excluded. Once I switch to not using (regex?) wildcards and for example exclude site/public/models/develop without wildcards like so, it works fine.

lychee --exclude-file ./support/lychee_ignore.txt  \
  --config ./support/lychee.toml \
  --output lychee_report.md \
  --verbose --format markdown \
  --exclude-path="site/public/models/develop" \
  site/public/**/*.html

I am using lychee v0.13.0. Do I have to use the WIP version of develop to test this feature working properly?

mre commented 1 year ago

Are you mixing up regex with glob, maybe? Instead of models*, can you try models.*?

aj-stein-nist commented 1 year ago

Are you mixing up regex with glob, maybe? Instead of models*, can you try models.*?

I see what I did there, apologies. Ah, well in that case, I go back to test and come back later. 😬

askalski85 commented 11 months ago

Hi folks, I am playing around with the --exclude-path on a simple scenario

.
└── a
    ├── a.html
    └── b
        ├── b.html
        └── c
            ├── c.html
            └── d
                └── d.html

I am able to exclude path using a/b or a file using a/b/b.html but using an asterix * nor .* for wildcarding file/folder names does not work for me.

lychee -vv --exclude-path 'a/b/.*' .
lychee -vv --exclude-path 'a/b/*' .

# same output for both
[./a/b/b.html]:
✗ [ERR] https://badlink.b.com/ | Failed: Network error: dns error: no record found for Query { name: Name("badlink.b.com.fritz.box."), query_type: AAAA, query_class: IN }

Same for 0.13.0 and for the nightly 11d8d448953d85300f2a5b22815cc0d310bd7ff2

mre commented 11 months ago

Try

lychee --dump --exclude-path 'a/b' .
askalski85 commented 11 months ago

lychee --dump --exclude-path 'a/b' .

tmp % lychee -vv --dump --exclude-path 'a/b' .
https://badlink.a.com/ (./a/a.html)
https://github.com/#a (./a/a.html)

Yes this works. But the online documentation states I can do thinks like */dev/* which apparently does not work:

tmp % lychee -vv --dump --exclude-path '*/b/*' .
https://github.com/#b (./a/b/b.html)
https://badlink.b.com/ (./a/b/b.html)
https://github.com/#c (./a/b/c/c.html)
https://badlink.c.com/ (./a/b/c/c.html)
https://github.com/#d (./a/b/c/d/d.html)
https://badlink.d.com/ (./a/b/c/d/d.html)
https://badlink.a.com/ (./a/a.html)
https://github.com/#a (./a/a.html)
mre commented 11 months ago

Well, the documentation mentions */dev/*, but I realized that it doesn't describe what it does. I think */dev/* is actually incorrect. It's not a regular expression to begin with. When I put it into a regex tester, I get errors like "Error: invalid target for quantifier". The reason is that the first * is a quantifier, which doesn't quantify anything. It's a glob pattern, but --exclude-path uses regex matching. So, we should remove the example.

mre commented 11 months ago

Replaced the pattern with .*/dev/.*.

I haven't looked into why your original regex didn't work. I think it should (?). For future reference, here is the module that handles path exclusions: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/types/input.rs I can see that there are some missing cases in our unit tests; e.g. excluding files like foo.html.

If someone finds the time, I'd appreciate a pull request for adding more cases. Maybe there is a bug in the path exclusion handling (or we need to document it better).

askalski85 commented 10 months ago

FYI: the suggested .*/dev/.* also does not seem to work.

tmp % lychee -vv --dump --exclude-path '.*/b/.*' .
https://badlink.c.com/ (./a/b/c/c.html)
https://github.com/#c (./a/b/c/c.html)
https://badlink.b.com/ (./a/b/b.html)
https://github.com/#b (./a/b/b.html)
https://badlink.a.com/ (./a/a.html)
https://github.com/#a (./a/a.html)
https://badlink.d.com/ (./a/b/c/d/d.html)
https://github.com/#d (./a/b/c/d/d.html)
mre commented 10 months ago

Looks like it doesn't match the full path, but just the last part? Can you play around with various regexes that match the filename and extension? Like *.html or (a|b).html and c\.htm.? (It could also be related to the backslash escaping)

Aariq commented 3 months ago

Exclusion seems inconsistent. Here's a small section of output from lychee --glob-ignore-case --dump -vv --exclude-path renv './**/*.rmd' for this repo: https://github.com/Aariq/website-quarto

https://twitter.com/lifedispersing/status/650088228016508928/photo/1 (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd) [excluded]
https://cran.r-project.org/package=viridis (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
https://github.com/njsmith (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
https://github.com/stefanv (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
https://seaborn.pydata.org/tutorial/color_palettes.html (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
https://colorbrewer2.org/ (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
https://twitter.com/jalapic (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd) [excluded]
https://gist.github.com/jalapic/9a1c069aa8cee4089c1e (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
https://twitter.com/lifedispersing (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd) [excluded]
https://cran.r-project.org/package=dichromat (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
http://pbs.twimg.com/media/CQWw9EgWsAAoUi0.png (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
http://ftp.cpc.ncep.noaa.gov/GIS/GRADS_GIS/GeoTIFF/TEMP/us_tmax/us.tmax_nohads_ll_20150219_float.tif (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0199239 (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd)
https://twitter.com/jalapic/status/650120284901634048/photo/1 (renv/library/R-4.2/x86_64-apple-darwin17.0/viridis/doc/intro-to-viridis.Rmd) [excluded]
oponomarov-tu commented 2 months ago

Looks like --exclude-path does not work at all for both globs and regular expressions.

thomas-zahner commented 3 weeks ago

With release 0.16.0 we have released #1500. With this change, lychee ignores files that are ignored by git by default. This seemed like the most sensible default to us and aligns with ripgrep. The behaviour can be disabled with the --no-ignore flag. Hidden files are also now also ignored by default. (disable with --hidden) The behaviour should be identical to ripgrep, because we use the same crate as ripgrep for traversing files called ignore.

Thanks to ignore you can now also ignore files from scanning with .ignore files as with ripgrep. So if you want to ignore files with lychee (plus ripgrep) that aren't ignored by git, just add them to a .ignore file.

For full documentation of what files are ignored see: standard_filters. All standard filters - with the exception of hidden files (--hidden) - are disabled with --no-ignore.