lycheeverse / lychee

⚡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.12k stars 128 forks source link

v0.9.0 regression: Unknown schemes throw errors #562

Closed MichaIng closed 2 years ago

MichaIng commented 2 years ago

We have some code blocks in our docs which contain schemes unknown to lychee but valid for other applications. Those were gracefully ignored before and now throw errors, e.g.:

Error: R] librespot:///usr/bin/librespot?name=mySpotify&devicename=SnapcastSpotify&disable_audio_cache=true | The given URI is invalid: librespot:///usr/bin/librespot?name=mySpotify&devicename=SnapcastSpotify&disable_audio_cache=true
Error: R] pipe:///tmp/mpd.fifo?name=myMPD&mode=read | The given URI is invalid: pipe:///tmp/mpd.fifo?name=myMPD&mode=read
Error: R] airplay:///usr/local/bin/shairport-sync?name=myAirport&devicename=SnapcastAirport&params=--configfile=/usr/local/etc/shairport-sync.conf | The given URI is invalid: airplay:///usr/local/bin/shairport-sync?name=myAirport&devicename=SnapcastAirport&params=--configfile=/usr/local/etc/shairport-sync.conf
Error: R] pipe:///tmp/mopidy.fifo?name=myMopidy&mode=read | The given URI is invalid: pipe:///tmp/mopidy.fifo?name=myMopidy&mode=read

Since lychee anyway catches those URLs by itself, while they are no explicit links, it doesn't make much sense to fail on them: "This looks like an URL, let me check it... but doesn't look like a valid URL, let me throw an error" 😃.

lebensterben commented 2 years ago

I agree. We should gracefully record those URLs as ignored for unsupported schemes.

MichaIng commented 2 years ago

I guess loose URI detection is generally done via word regex like ://. Probably it makes sense to detect via explicit scheme, not adding URIs with unknown/invalid schemes to the query in the first place, like (https?|ftp|...)://? In the past they were still handled, just added to excluded URLs internally, but not handling them at all seems more reasonable.

mre commented 2 years ago

I broke it when I added a check for valid URLs before passing them to reqwest in 45de5c763e962ffe4b96af5b3f8809b0b4989bc5. Before that, we relied on reqwest to return an error if a scheme was not supported. I've created a fix in #571.

mre commented 2 years ago

Change got merged to master now. We can discuss hiding unsupported links in a separate issue. However I'm inclined to keep the current behavior as described in the discussion in the PR, because I think it's helpful to see that links got detected even if they cannot be checked (at the moment).