lycheeverse / lychee

⚡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.22k stars 134 forks source link

Fails on escape characters in markdown link #1529

Closed LitoMore closed 2 weeks ago

LitoMore commented 1 month ago

How to reproduce

Use this markdown below:

- <img height="14" src="https://cdn.simpleicons.org/simpleicons/_/eee"/> - https://cdn.simpleicons.org/simpleicons/_/eee
- <img height="14" src="https://cdn.simpleicons.org/simpleicons/eee/_"/> - [https://cdn.simpleicons.org/simpleicons/eee/\_](https://cdn.simpleicons.org/simpleicons/eee/_)]

It returns:

[README.md]:
     [404] https://cdn.simpleicons.org/simpleicons/
     [404] https://cdn.simpleicons.org/simpleicons/eee/
mre commented 1 month ago

An easier example test-case:

[Example Link](https://example.com/page\_with\_underscores)

Fixing it might be as simple as

Event::Start(Tag::Link { link_type, dest_url, .. }) => {
    match link_type {
        LinkType::Inline => {
            Some(vec![RawUri {
                text: unescape_url(&dest_url),
                element: Some("a".to_string()),
                attribute: Some("href".to_string()),
            }])
        }
        // ... handle other link types similarly
    }
}

// Helper function which removes escape characters from Markdown links
fn unescape_url(url: &str) -> String {
    url.replace("\\_", "_")
}

in the code here and in the other place where we call text: dest_url.to_string().

If anyone likes to create a pull request, I'd be thankful. Bonus points for adding the example above as a test-case.

LitoMore commented 1 month ago

Note there are two errors, this means the first line also errored.

- <img height="14" src="https://cdn.simpleicons.org/simpleicons/_/eee"/> - https://cdn.simpleicons.org/simpleicons/_/eee
mre commented 2 weeks ago

Turns out it was an issue with the Markdown parser. More details in #1555.