lycheeverse / lychee

⚡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.23k stars 136 forks source link

Bug: lychee can not detect error relative url #1480

Closed awang-01 closed 1 month ago

awang-01 commented 3 months ago

for this site: https://awang-01.github.io/testing/, there is an image with src="testing/images/lychee.png" that I was expecting to fail, but

lychee -v https://awang-01.github.io/testing/
✔ [200] https://awang-01.github.io/testing/images/lychee.png

🔍 1 Total (in 0s) ✅ 1 OK 🚫 0 Errors

the https://awang-01.github.iotesting/images/lychee.png should fail, but lychee automatically add a / before the src

mre commented 3 months ago

Just checked out your sample page, and it seems to work as expected.

When clickinging on the link, it brings me to https://awang-01.github.io/testing/images/lychee.png, which seems to be the correct URL.

The page responds with a 404, though.

However, when I open it on the command-line with curl, it returns a 200:

 curl -vvv https://awang-01.github.io/testing/images/lychee.png

It gives me:

 < HTTP/2 200
< server: GitHub.com
< content-type: image/png
< permissions-policy: interest-cohort=()
< last-modified: Wed, 07 Aug 2024 00:08:14 GMT
< access-control-allow-origin: *
< strict-transport-security: max-age=31556952
< etag: "66b2baee-176be"
< expires: Wed, 07 Aug 2024 00:33:29 GMT
< cache-control: max-age=600
< x-proxy-cache: MISS
< x-github-request-id: C8F3:2D8599:2BDA2D6:2D0C67A:66B2BE81
< accept-ranges: bytes
< age: 0
< date: Wed, 07 Aug 2024 00:23:29 GMT
< via: 1.1 varnish
< x-served-by: cache-fra-etou8220027-FRA
< x-cache: MISS
< x-cache-hits: 0
< x-timer: S1722990209.287670,VS0,VE101
< vary: Accept-Encoding
< x-fastly-request-id: 68cf52bf04fa7335da5ee09db44282dbdfce6794
< content-length: 95934
<
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
* Failure writing output to destination
* Connection #0 to host awang-01.github.io left intact

That's very similar to what lychee sees, I guess. Note that lychee doesn't differentiate between images and other content when resolving links. For lychee, all that matters is the response code. In this case, the page returns a 200. I don't know why it returns a 200 on the CLI, maybe some GitHub pages bot detection? I'm not aware of such a mechanism.

nobkd commented 3 months ago

No, mre. You got that wrong.

The image on https://awang-01.github.io/testing/ has a relative src of testing/images/lychee.png and should result in an absolute URL of https://awang-01.github.io/testing/testing/images/lychee.png (see duplicated testing) which would fail, because it does not exist (which is expected here).

But lychee uses https://awang-01.github.io/testing/images/lychee.png which is the correct path to the image, but wrong in this context, because the relative image URL was resolved as relative to the hostname instead of the current location, I think.

I think this could be a duplicate of #1296


Edit: Simple reproduction: (For the other way around. Pages noted as missing, when they're there.) File tree: ```sh root └── test ├── index.html └── next.html ``` `root/test/index.html`: ```html next ``` `root/test/next.html`: ```html just needs to exist. ``` Serve a site from `root` (e.g. `python3 -m http.server -d . 3000`) ```sh lychee http://localhost:3000/test/ # or lychee http://localhost:3000/test/index.html ``` Results in: ```sh > lychee http://localhost:3000/test/index.html 1/1 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links Issues found in 1 input. Find details below. [http://localhost:3000/test/index.html]: ✗ [404] http://localhost:3000/next.html | Failed: Network error: Not Found 🔍 1 Total (in 0s) ✅ 0 OK 🚫 1 Error ``` As you can see, the relative link is not resolved correctly by lychee. You can open the entry page in a browser of your choice and see that you can access the `next` page. --- Just as a note: ```sh > lychee --version lychee 0.15.1 ```

[!note] Also, running the above example like lychee . where . == root, means, testing on file system instead of http(s), works correctly.

mre commented 1 month ago

It's fixed now. 🎉

lychee -v https://awang-01.github.io/testing/                                                                                          ✘ 
     [404] https://awang-01.github.io/testing/testing/images/lychee.png
     [200] https://awang-01.github.io/testing/assets/styles.css

Issues found in 1 input. Find details below.

[https://awang-01.github.io/testing/]:
     [404] https://awang-01.github.io/testing/testing/images/lychee.png

🔍 2 Total (in 0s) ✅ 1 OK 🚫 1 Error
mre commented 1 month ago

Forgot to mention that it's fixed in master only for now and will be shipped in our next release, 0.17.0.