lycheeverse / lychee

⚡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.16k stars 130 forks source link

Check for anchors in destination page? #1363

Closed gbmhunter closed 9 months ago

gbmhunter commented 9 months ago

Hi, first off...thanks for the work you have put into this tool, it appears to be one of the best link checkers out there!

I have a use case where I want to not only check if the URL is valid, but also check to make sure the anchor exists at the destination if it's provided in the URL. I set include_fragments = true in my lychee.toml thinking this would check for anchors. I then inserted an incorrect link with anchor in my static site to see if it would pick it up (http://localhost:1313/mathematics/geometry/triangles/#law-of-sinesDEBUG).

It didn't seem to work? The .lycheecache file says this URL returned a 200. There is no #law-of-sinesDEBUG anchor present on the page (it's served by a dev. server).

http://localhost:1313/mathematics/geometry/triangles/#law-of-sinesDEBUG,200,1706480444

Am I misunderstanding what include_fragments does, and if so, is there any way of checking the anchor exists?

mre commented 9 months ago

Hey @gbmhunter,

only anchor tags inside Markdown documents are supported right now. Checking anchors in URLs is harder. Here is a technical discussion of the problem space: https://github.com/lycheeverse/lychee/issues/185#issuecomment-1694670649

Essentially, this would require a bigger rewrite of some of the inner components, which is planned but has not started yet. There is a broader architecture discussion here, which ties into this. https://github.com/lycheeverse/lychee/issues/1252

gbmhunter commented 9 months ago

Thanks for the response! I also discovered (I think) that it won't check a website recursively? That is another part of my use case, to point it at the homepage of the site at it "crawl" to all pages within the domain based on the links it finds.

mre commented 9 months ago

Yes, correct. That will require architectural changes.

mre commented 9 months ago

The current workaround is to point it at the sitemap.

gbmhunter commented 9 months ago

@mre thanks for the info!