stevenvachon / broken-link-checker

Find broken links, missing images, etc within your HTML.
MIT License
1.95k stars 302 forks source link

UTF8 characters cause valid links to be detected as broken #234

Open matkoniecz opened 3 years ago

matkoniecz commented 3 years ago

I prepared test case with https://github.com/matkoniecz/broken-link-checker-local-utf8

blc https://matkoniecz.github.io/broken-link-checker-local-utf8 -r

See https://matkoniecz.github.io/broken-link-checker-local-utf8/ - both link work, one with utf8 characters gets BLC_UNKNOWN/HTTP_undefined errors

mateusz@grima:~$ blc https://matkoniecz.github.io/broken-link-checker-local-utf8 -r
Getting links from: https://matkoniecz.github.io/broken-link-checker-local-utf8
├───OK─── https://matkoniecz.github.io/broken-link-checker-local-utf8/test%20space.html
└─BROKEN─ https://matkoniecz.github.io/broken-link-checker-local-utf8/test_zażółć.html (BLC_UNKNOWN)
Finished! 2 links found. 1 broken.

Getting links from: https://matkoniecz.github.io/broken-link-checker-local-utf8/test%20space.html
└─BROKEN─ https://matkoniecz.github.io/broken-link-checker-local-utf8/test_zażółć.html (HTTP_undefined)
Finished! 2 links found. 1 excluded. 1 broken.

Finished! 4 links found. 1 excluded. 2 broken.
Elapsed time: 1 second

Sorry if that is my misunderstanding but as I understand it the UTF8 is de facto working in links

UTF8 may be internally different but browsers seems 100% fine with links including letters like https://en.wikipedia.org/wiki/Ogonek

Sanity check: https://stackoverflow.com/questions/22357509/can-urls-have-utf-8-characters

Even DNS supports URF8 characters (with some workarounds and restrictions) https://en.wikipedia.org/wiki/Internationalized_domain_name

replaces https://github.com/LukasHechenberger/broken-link-checker-local/issues/50

rezaalavi commented 2 years ago

I have the same problem with websites in Chinese and Thai languages. While the links exist the program reports an error of type (BLC_UNKNOWN)

mayrsascha commented 1 year ago

I have the same problem with grave accents and acute accents, those are very common in Latin languages and present in other languages too. For example https://www.iswatersafetodrink.in/Italy/Cantù

matkoniecz commented 4 months ago

Can I do anything so as the first step "needs confirmation" can be dropped?