gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.39k stars 1.77k forks source link

Cookiejar does not respect several domains #254

Open jespern opened 6 years ago

jespern commented 6 years ago

I'm attempting to use colly to write a program that first logs in to a site, and then obtains some information. The site uses 2 different domains to handle authentication, say www.site.com and auth.site.com. The flow begins with going to www (this is necessary to get an initial cookie), being redirected to auth. From there, I POST my data to log in, and another cookie is set. I am then redirected to www.

The problem is, both of these sites use the same cookie name. When I am redirected back to www after logging in, colly supplies both cookies, namely the initial one, and the "logged in" one set by auth. I've debugged this extensively using a proxy, and I can see that colly sends "Cookie: _auth=false; _auth=true", so the webserver I'm speaking to trips up and sends me into an infinite redirect loop. After doing some research on this, it seems to be because there's no distinction between the cookies being set in the flow, as of to which domain they belong to. Colly simply sends all of them.

Unfortunately I am not nearly qualified to address this, but nonetheless here's a bug report, and I hope someone more skilled will address it eventually.

asciimoo commented 5 years ago

Could you provide a test case which reproduces this situations?