lycheeverse / lychee

⚡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.23k stars 136 forks source link

bug: anchor/fragment detection doesn't appear to work #1457

Open sxlijin opened 5 months ago

sxlijin commented 5 months ago

When checking https://docs.boundaryml.com, muffet detects when an anchor doesn't exist

https://docs.boundaryml.com/docs/syntax/client/client
        400     https://mintlify.com?utm_campaign=poweredBy&utm_medium=docs&utm_source=docs.boundaryml.com
        403     https://platform.openai.com/docs/models
        431     https://twitter.com/boundaryml
        id #L20 not found       https://github.com/anthropics/anthropic-sdk-python/blob/fc90c357176b67cfe3a8152bbbf07df0f12ce27c/src/anthropic/types/completion_create_params.py#L20
        id #L28 not found       https://github.com/openai/openai-python/blob/9e6e1a284eeb2c20c05a03831e5566a4e9eaba50/src/openai/types/chat/completion_create_params.py#L28
        id #generate-a-chat-completion not found        https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion

but lychee does not, even with fragment detection explicitly specified (presumably it's on by default):


boundary-website on  sam/new-language-pt1 via  v20.14.0 | [2] took 1s at 12:09:40
❯ lychee --include-fragments https://docs.boundaryml.com/docs/syntax/client/client
  64/64 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                            Issues found in 1 input. Find details below.

[https://docs.boundaryml.com/docs/syntax/client/client]:
✗ [403] https://platform.openai.com/docs/models | Failed: Network error: Forbidden
✗ [403] https://mintlify.s3-us-west-1.amazonaws.com/gloo/_generated/favicon/browserconfig.xml?v=3 | Failed: Network error: Forbidden

🔍 64 Total (in 0s) ✅ 60 OK 🚫 2 Errors 💤 2 Excluded

boundary-website on  sam/new-language-pt1 via  v20.14.0 | [2] took 833ms at 12:09:50
❯ lychee --include-fragments https://docs.boundaryml.com/docs/
  68/68 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                            Issues found in 1 input. Find details below.

[https://docs.boundaryml.com/docs/]:
✗ [403] https://mintlify.s3-us-west-1.amazonaws.com/gloo/_generated/favicon/browserconfig.xml?v=3 | Failed: Network error: Forbidden
✗ [404] https://github.com/BoundaryML/baml-examples/tree/main/fastapi-starter | Failed: Network error: Not Found

🔍 68 Total (in 1s) ✅ 64 OK 🚫 2 Errors 💤 2 Excluded
💡 There were issues with GitHub URLs. You could try setting a GitHub token and running lychee again.%
cceckman commented 5 months ago

I was seeing a similar issue when checking local files. I learned from other issues that -v will include excluded links in the output; with the command-line:

lychee --include-fragments --config lychee.toml -v public/

I got e.g.:

? [EXCLUDED] file:///<working directory>/public/writing/notes/index.html#procedural | Excluded

Workaround?

I modified my lychee.toml to include scheme = ["http", "https", "file"]; with that, the same command-line covered the fragments.

This is quite counterintuitive, because I never specify file: URLs in my source. The file public/writing/notes/index.html contains <a href="#procedural"> -- it's Lychee that is creating the file: scheme.

My expectation would be that, if I specify a local path, all relative links would be checked; scheme would only apply to links that specify a scheme.

mre commented 2 months ago

This is a summary for anyone interested in submitting a pull request to fix this issue.

Bug Summary

lychee fails to detect missing anchors/fragments in remote URLs, even when fragment detection is explicitly enabled.

Reproduction Steps

Run the following command:

echo 'https://github.com/lycheeverse/lychee#non-existent-anchor' | lychee - -vvv --include-fragments

Expected: lychee reports the missing anchor Actual: lychee reports the link as OK:

✔ [200] https://github.com/lycheeverse/lychee#non-existent-anchor
🔍 1 Total (in 0s) ✅ 1 OK 🚫 0 Errors

Proposed Fix

Update lychee to properly check and report missing anchors/fragments when parsing remote URLs when --include-fragments is used.