gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
22.98k stars 1.75k forks source link

Better URL parsing according to whatwg URL standard #596

Open WGH- opened 3 years ago

WGH- commented 3 years ago

As of now, Colly parses URLs with Go stdlib's net/url parser. This parser is somewhat simple, and doesn't do some quirks that browsers do. Since Colly is a web crawling framework, in order to be able to handle all weird stuff that happens in the WWW, it should better follow the browsers here.

Fortunately, there's a web standard that codifies the quirks: https://url.spec.whatwg.org/#url-parsing

I'll give a few examples that net/url parser handles incorrectly:

I have found this Go library: https://github.com/nlnwa/whatwg-url. This library doesn't appear to be popular, but it has rather large test suite borrowed from https://github.com/web-platform-tests/wpt/tree/master/url which it seems to pass.

Rather than implementing these URL parsing quirks one by one in Colly and duplicating efforts, I think I'll check out that library, maybe it's perfect fit for us, and report back with results.

What do you think?

asciimoo commented 3 years ago

@WGH- your points are valid, I totally agree. There is no need to reinvent the wheel. I have to look at whatwg-url in more detail, but it looks promising at first glance.

fussraider commented 9 months ago

I am using version 2.1.0 and faced a similar problem: if there are special characters in the link that I am trying to get data from for further processing, then the request will be executed with encoded characters. Not all sites understand this correctly and return 404.

For example, I have the address of the page: https://example.com/some's-page-path The request will be sent to: https://example.com/some%27s-page-path

And since the server does not handle such cases, it will return 404.

I agree that this is a mistake on the side of a particular site, but there is no way to influence it, and i need to parse data 😇

Is there any solution to this problem now?