gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
22.98k stars 1.75k forks source link

URL normalization #361

Open owenson opened 5 years ago

owenson commented 5 years ago

Colly does not currently appear to do any URL normalization. For example querystring parameters need to be reordered in alphabetical order, host lowercased, etc.

See https://github.com/PuerkitoBio/purell

asciimoo commented 4 years ago

Why do parameters need to be reordered?

owenson commented 4 years ago

/test?a=1&b=2 and /test?b=1&a=2 point to the same thing

asciimoo commented 4 years ago

@owenson this depends on the webapp, it can distinguish these cases and return different results for the two urls. Also in the first url contains a=1 and b=2 but in the second a=2 and b=1, i assume this is just a typo.

owenson commented 4 years ago

Yes it was a typo. I've yet to see a web app which cares about order, or indeed any Web frameworks that let you get that info. Most crawlers will normalise the query parameters in this way. Nutch does it, scrapy does it, etc.

asciimoo commented 4 years ago

I'm open to adding an option to normalize url. Would you like to work on this?

owenson commented 4 years ago

I've used this in the past.

https://github.com/PuerkitoBio/purell

asciimoo commented 4 years ago

Great, thanks!