Open WGH- opened 3 years ago
@WGH- your points are valid, I totally agree. There is no need to reinvent the wheel. I have to look at whatwg-url in more detail, but it looks promising at first glance.
I am using version 2.1.0 and faced a similar problem: if there are special characters in the link that I am trying to get data from for further processing, then the request will be executed with encoded characters. Not all sites understand this correctly and return 404.
For example, I have the address of the page: https://example.com/some's-page-path The request will be sent to: https://example.com/some%27s-page-path
And since the server does not handle such cases, it will return 404.
I agree that this is a mistake on the side of a particular site, but there is no way to influence it, and i need to parse data 😇
Is there any solution to this problem now?
As of now, Colly parses URLs with Go stdlib's
net/url
parser. This parser is somewhat simple, and doesn't do some quirks that browsers do. Since Colly is a web crawling framework, in order to be able to handle all weird stuff that happens in the WWW, it should better follow the browsers here.Fortunately, there's a web standard that codifies the quirks: https://url.spec.whatwg.org/#url-parsing
I'll give a few examples that
net/url
parser handles incorrectly:<a href="/?тест">foo</a>
and<a href="/?%D1%82%D0%B5%D1%81%D1%82">bar</a>
both lead to the same location, and HTTP request on wire would beGET /?%D1%82%D0%B5%D1%81%D1%82 HTTP/1.1
in both cases (this assumes UTF-8 encoding). Note that simply percent-encoding the input is wrong, as it would lead to double-encoded string when the input is already ok.I have found this Go library: https://github.com/nlnwa/whatwg-url. This library doesn't appear to be popular, but it has rather large test suite borrowed from https://github.com/web-platform-tests/wpt/tree/master/url which it seems to pass.
Rather than implementing these URL parsing quirks one by one in Colly and duplicating efforts, I think I'll check out that library, maybe it's perfect fit for us, and report back with results.
What do you think?