stevenvachon / broken-link-checker

Find broken links, missing images, etc within your HTML.
MIT License
1.97k stars 305 forks source link

Is there an option to say a URL starting with '//' and then having a proper hostname is not broken? #173

Closed vvdwivedi closed 4 years ago

vvdwivedi commented 4 years ago

I am getting a lot of URLs like these reported as broken:

//www.google.com
//fonts.googleapis.com

But these seem to be valid URLs when rendered in the page and nothing is broken because of that. I am not sure of the exact technical term, but I think these are scheme relative url strings https://url.spec.whatwg.org/#scheme-relative-url-string

I noticed that you have a package isurl and there is a lenient way of checking for valid URLs. Not sure if these URLs are reported broken because of the test by isurl, but if yes, can we add an option there to allow such urls?

Environment:

stevenvachon commented 4 years ago

Is it solved with the master branch (unreleased v0.8)?

vvdwivedi commented 4 years ago

Haven't tried with master branch, just the released version. Will test with master branch and confirm by today.

vvdwivedi commented 4 years ago

Here is what I tested. I built from master branch and ran bin/blc https://pg.vvdwivedi.com/broken-links.html

Here is the relevant source of page for quick reference

`

` This is the result for the run: Getting links from: https://pg.vvdwivedi.com/broken-links.html ├───OK─── https://pg.vvdwivedi.com/index.html ├───OK─── https://pg.vvdwivedi.com/img/small-img.png ├───OK─── https://pg.vvdwivedi.com/files/a.txt ├───OK─── https://www.google.com/ ├─BROKEN─ https://pg.vvdwivedi.com/index-broken.html (HTTP_404) ├─BROKEN─ https://pg.vvdwivedi.com/img/small-img2.png (HTTP_404) ├─BROKEN─ https://pg.vvdwivedi.com/files/ab.txt (HTTP_404) ├─BROKEN─ https://pg.vvdwivedi.com/www.google.com (HTTP_404) ======================= Links found: 16 Links skipped: 5 Links OK: 7 Links broken: 4 Time elapsed: 0 seconds ======================= I can see that it's considering the url starting with `//` as relative and appending the host, which results in a 404.
stevenvachon commented 4 years ago

//www.google.comhttps://www.google.com/ OK www.google.comhttps://pg.vvdwivedi.com/www.google.com 404

Looks fine to me.

vvdwivedi commented 4 years ago

Yes, you are right. Got a little confused there. The new version is fine.

After a quick run on 0.7.8 and 0.8.0, got following results:

From v 0.8.0 `Getting links from: https://pg.vvdwivedi.com/broken-links.html ├───OK─── https://pg.vvdwivedi.com/index.html ├───OK─── https://pg.vvdwivedi.com/img/small-img.png ├───OK─── https://pg.vvdwivedi.com/files/a.txt ├───OK─── https://www.google.com/ ├─BROKEN─ https://pg.vvdwivedi.com/index-broken.html (HTTP_404) ├─BROKEN─ https://pg.vvdwivedi.com/img/small-img2.png (HTTP_404) ├─BROKEN─ https://pg.vvdwivedi.com/files/ab.txt (HTTP_404) ├─BROKEN─ https://pg.vvdwivedi.com/www.google.com (HTTP_404)

======================= Links found: 16 Links skipped: 5 Links OK: 7 Links broken: 4 Time elapsed: 0 seconds =======================`

From v 0.7.8

Getting links from: https://pg.vvdwivedi.com/broken-links.html ├───OK─── https://pg.vvdwivedi.com/index.html ├───OK─── https://pg.vvdwivedi.com/img/small-img.png ├───OK─── https://pg.vvdwivedi.com/files/a.txt ├───OK─── https://www.google.com/ ├─BROKEN─ https://pg.vvdwivedi.com/index-broken.html (HTTP_404) ├─BROKEN─ https://pg.vvdwivedi.com/img/small-img2.png (HTTP_404) ├─BROKEN─ https://pg.vvdwivedi.com/files/ab.txt (HTTP_404)