psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
51.97k stars 9.29k forks source link

Make sessions safe[r] in multi-process environment #4323

Open rabbbit opened 6 years ago

rabbbit commented 6 years ago

tldr; in multi-process environment (Celery) sessions might lead to request/responses being mixed up.

It is unsafe to use Session in a multi-process environment - if the fork happens after Session initialisation the underlying connection pool will be shared across both processes, leading to potentially dangerous and hard to debug issues.

I'm not sure what should happen - whather a code change is necessary, or a documentation change is enough. Please advise :)

Use case

It is likely to happen if Session is created at module load time, like:

class MyClient(object):
    session = requests.Session()

    def do_things(self, **params):
         self.session.get(**params)

or if a 3rd party client is imported at the module level, where it becomes totally invisible:


my_database_client = database.DatabaseClient(**params)

class MyDatabaseWrapper(object):

      def do_things(self, **params):
            my_database_client.update(**params)

This is particularly tricky in a Celery, where a previously working function might start causing troubles if invoked from Celery. Celery seems like a common Python use case.

I've seen this pattern in 3 different repos written by 3 different developers - it feels common enough for it to be a problem.

Reason

This is related to https://github.com/shazow/urllib3/issues/850 in urllib3, where it was stated that it's the callers' responsibility to worry about forking - in this case, it's Requests.

Expected Result

Ideally, a new Session with the same parameters would be started by Requests.

If that's too complicated, I'd expect an exception to be thrown if PID change was detected.

At the very least, docs should state the expected behaviour.

Actual Result

The responses are mixed up - one process might receive a response made for a call it didn't make.

Reproduction Steps

import os
import sys
import requests

MAX = 20
s = requests.Session()

for n in range(MAX):
    pid = os.fork()
    if pid == 0:
        try:
            # s.mount("http://", requests.adapters.HTTPAdapter())  # uncomment to fix
            r = s.get('http://httpstat.us/300?sleep=100')
        except Exception:
            # ignoring intermittent http errors
            pass
        sys.exit(0)
    else:
        try:
            r = s.get('http://httpstat.us/200')
        except Exception as exc:
            # ignoring intermittent http errors
            pass

        if r.status_code != 200:
            print 'pid: {} Call {}/{}. Wrong status code detected {}'.format(
                os.getpid(),
                n,
                MAX,
                r.status_code
            )
pid: 30996 Call 7/20. Wrong status code detected 300   
pid: 30996 Call 9/20. Wrong status code detected 300   
pid: 30996 Call 14/20. Wrong status code detected 300  
pid: 30996 Call 17/20. Wrong status code detected 300  
pid: 30996 Call 19/20. Wrong status code detected 300 

System Information

$ python -m requests.help

pawel@pawel-C02V306VHTDH ~/Uber/tokenizer-python $ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  }, 
  "cryptography": {
    "version": ""
  }, 
  "idna": {
    "version": "2.6"
  }, 
  "implementation": {
    "name": "CPython", 
    "version": "2.7.14"
  }, 
  "platform": {
    "release": "16.7.0", 
    "system": "Darwin"
  }, 
  "pyOpenSSL": {
    "openssl_version": "", 
    "version": null
  }, 
  "requests": {
    "version": "2.18.4"
  }, 
  "system_ssl": {
    "version": "100020cf"
  }, 
  "urllib3": {
    "version": "1.22"
  }, 
  "using_pyopenssl": false
}```

This command is only available on Requests v2.16.4 and greater. Otherwise,
please provide some basic information about your system (Python version,
operating system, &c).
Lukasa commented 6 years ago

So my suggestion here is that we simply document this as a risk. It can’t happen merely by creating a Session before forking, it has to actually be used before the fork. The difficulty is that forking combined with multi threading and streaming means that we need to check PIDs constantly: additionally, there is no appropriate recovery in many cases.

As a result I suggest this is best handled at application scope: either create sessions after forking or wrap them in a data structure that understands your forking paradigm. There is just no general solution to this problem.

thepegasos commented 6 years ago

I second to this suggestion. I was using same sessions object across multiple independent processes, and boy, I had to literally spend 3 hours scratching my head before I could finally pinpoint to this as a root cause.

Oddly though, my code was working perfectly on a windows platform. On Linux, I was having a issue with responses mixups. I think may be because windows and Linux handle multiprocessing differently? Not sure here.

sigmavirus24 commented 6 years ago

I think may be because windows and Linux handle multiprocessing differently? Not sure here.

They do. multiprocessing on Windows can't fork in the same was as linux so you were not actually sharing the same session across processes

sivang commented 6 years ago

I am also having issues with celery and requests, this results in unexplained hangups while .get'ing , bad content, and auth tokens being randomly invalidated.

sigmavirus24 commented 6 years ago

Quoting:

As a result I suggest this is best handled at application scope: either create sessions after forking or wrap them in a data structure that understands your forking paradigm. There is just no general solution to this problem.

sivang commented 6 years ago

Would using two session objects, one for authentication pre-fork, extracting the cookiejar , passing the cookiejar to the to-be-forked task (-celery) and then using it to do whatever needs to be done against the server work?

sivang commented 6 years ago

So I think I've got it, and have a code recipe for a best practice, should I go ahead and write something and then submit a PR for the docs?

JackieMa000 commented 5 years ago

@sivang Can you please share your code recipe for best practice?