Address deleting cookies when domain-less defaults are used.

jvanasco commented 6 years ago

I found a few variations of this "implementation detail" in other reports, but I believe this particular use case is worth addressing.

If a URL is retrieved with either a dict or unqualified cookie jar, the deletions are ignored.

# use a dict
resp = requests.get(url, cookies={'cookie1': 'value'},)

or

# use a cookiejar
session1 = requests.Session()
session1.cookies = requests.cookies.RequestsCookieJar()
session1.cookies.set('foo', 'bar')
resp1 = requests.get(url)

This can create endless redirect loops if the server's response to a resource/url combination is to set a "delete cookie", which requests can not handle. Aside from failing tests and puzzled developers, this can trigger abuse flags with 3rd party api services.

Expected Result

If the server sends a cookie deletion value, it should be respected and not sent to the server.

Actual Result

passing a 'cookies=" dict to requests.get() creates a wrapped session with the default cookiejar. the current implementation just deletes any matching cookie for the domain, which fails because the cookie was set as a default (not for that domain).

Potential Way To Fix

A potential fix is to extend the RequestsCookieJar to use a blacklist when domain-less defaults are provided.

When a SetCookie deletion is encountered and the cookiejar contains domain-less cookies:

delete the domain cookie (if it exists)
and populate the blacklist.

When a request is made, filter the defaults against active blacklist values.

When a SetCookie update is made, remove it from the blacklist (or update the blacklist's to note it as invalid until the cookie's expiry date)

Reproduction Steps

import requests

# cookies not unset
session1 = requests.Session()
session1.cookies = requests.cookies.RequestsCookieJar()
session1.cookies.set('foo', 'bar')

r_set = session1.get("http://httpbin.org/cookies/set?bar=foo")
print(r_set.json())
r_del = session1.get("http://httpbin.org/cookies/delete?foo=bar")
print(r_del.json())
r_check = session1.get("http://httpbin.org/cookies")
print(r_check.json())

# cookies unset because domain matches
session2 = requests.Session()
session2.cookies = requests.cookies.RequestsCookieJar()
session2.cookies.set('foo', 'bar', domain='httpbin.org', path='/')

rr_set = session2.get("http://httpbin.org/cookies/set?bar=foo")
print(rr_set.json())
rr_del = session2.get("http://httpbin.org/cookies/delete?foo=bar")
print(rr_del.json())
rr_check = session2.get("http://httpbin.org/cookies")
print(rr_check.json())

System Information

"version": "2.18.4"

sigmavirus24 commented 6 years ago

Thanks for this @jvanasco

I had started hacking on a branch that would make the domain and path required with our cookie jar and probably provide a better interface to all of this in general. Sadly, I think that got lost in the suffle and I'd forgotten it. The API design change needs to be made for v3.0 to make this more tenable.

jvanasco commented 6 years ago

Making the domain and path required would be much easier than the blacklist idea I was stubbing out (nowhere near a PR, just some tests on the general logic working before I posted here).

If that is required, IMHO, it would make sense to deprecate the 'cookies' argument to requests.get() (lose the docs, print a warning) and change the docs around cookiejars and Session()to note the impending change.

Our unit tests just tossed a dict of cookies into requests.get(), because that's what the docs did. It took an embarrassing amount of time to realize I had successfully fixed an edge case we added a test for, and the problem was in the test/requests. Then it took looking at the code and a history of tickets to realize that was a bad idea and requests didn't do all the wonderfully brilliant things I expected it to (because it does all those wonderfully brilliant things everywhere else).

sigmavirus24 commented 6 years ago

@jvanasco I completely agree our behaviour is far from wonderfully brilliant like the rest of the project. That said, I think there's a difference between the following:

requests.request(METHOD, url, cookies=key_value_pair_cookies)

And

session.cookies.set(key, value)

The first is a convenience that I think we can reasonably keep around. Underneath the functional API is a Session. If we have the session treat that dictionary as calling .set(key, value, domain=parsed_domain, path=parsed_path) we can probably then simplify our handling of cookies and the cookie-jar will respect deletions. I think that's a potentially far better solution than deprecating it altogether.

The second is just wrong. ;)

jvanasco commented 6 years ago

You are correct - parsing the cookie domain/path out of the submitted url would be a much better idea than trashing it, but you'll still end up with issues over the potential of a leading dot in the domain (and ugh, the path, but let's not go there yet). This can be infuriating to debug if a problem pops up from it.

For example, consider making a request to example.com with a foo cookie:

requests.get('http://example.com/path/to/url', cookies={'foo': 'bar'})

let's also assume the path is just set for the root / (because it gets even more complicated otherwise)

The server might be configured to send the SetCookie header for foo with any of three options for the domain:

domain = None
domain = 'example.com'
domain = '.example.com'

The RFCs and browser behavior on all this were switching around for a few years too on what no domain should imply, or the behavior of the leading dot. While everything is settled right now, there are still a lot of legacy clients/servers out there, and the current requests codebase doesn't seem to standardize on anything.

So it could be represented by two ways of setting cookies in the jar.

jar.set('foo', 'bar', domain='example.com')
jar.set('foo', 'bar', domain='.example.com')

If the server sends an update/clearout for the cookie value, it's possible that requests will have guessed wrong. While requests sends some data to the underlying cookielib functionality from the standard library over the initial dot or domain being set, these are treated as unique cookies and must be deleted/updated independently.

We just updated/unified all our python and js cookie libraries. I had the "pleasure" of reconciling a bunch of issues around how differently libraries & browsers were handling the leading domain dot differently. (sidenote: My lazy solution on the server side was to migrate everything to specifying the domain with a leading dot, but also killing cookies with both leading dot and not. i hate cookies now. )

psf / requests