psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.01k stars 9.3k forks source link

problem with downloading when http_proxy set #1686

Closed smoser closed 10 years ago

smoser commented 10 years ago

We originally hit this under ubuntu bug http://pad.lv/1240652 . Running 'go-requests' (see below) like this:

$ url="http://cloud-images.ubuntu.com/releases/precise/release/ubuntu-12.04-server-cloudimg-amd64-disk1.img"
$ http_proxy=http://localhost:3128/ python go-requests.py $url out.img

results in a hang and eventually a timeout. The code is doing a open, then close, then open again of the same url, which seems to trigger the bug. Clearly that is a silly thing to do, but should not result in a hang.

I'm not certain if this is really a bug in urllib3 or in requests use of it.

#!/usr/bin/python
BUFLEN = 1024 * 10

import requests, sys
if len(sys.argv) > 3:
   BUFLEN = int(sys.argv[3])

cq = requests.get(sys.argv[1], stream=True)
cq.raw.read(1024)
cq.close()

req = requests.get(sys.argv[1], stream=True)
with open(sys.argv[2], "wb") as wfp:
   while True:
      buf = req.raw.read(BUFLEN)
      wfp.write(buf)
      if len(buf) != BUFLEN:
          break
req.close()
Lukasa commented 10 years ago

Hmm, good spot. The answer is that I don't know, I'll try to dig into this over the weekend.

sigmavirus24 commented 10 years ago

Might I ask a couple questions?

  1. Why make two requests?
  2. Why read from the first and then close it?
  3. Why not use Response#iter_content?
  4. Does the code ever reach the second request?
  5. Where does the code hang if not before the second?
smoser commented 10 years ago

1.) no good reason, but its not explicitly wrong 2.) see 1. 3.) the code i was using this from expects a 'read' interface. no really good reason, but again, thats not explicitly wrong. surely Response#read is supported. I suspect it wouldn't be any different. 4.) It does, but you can just try it and see. On an ubuntu instance or kvm guest or lxc container, somewhere just 'apt-get install squid3 and then use 'http_proxy=http://localhost:3128/' 5.) i think it ends up getting one read in the second.

So overaall, yeah, I could change code to do things a bunch of different way, and as I said in the original bug open, the 2 opens is silly. But doesn't seem explicitly wrong.

Lukasa commented 10 years ago

Ok, so I can't reproduce this using mitmproxy on my box. I wonder if squid doesn't like something we're doing.

smoser commented 10 years ago

Lukasa, If you're interested in debugging, I set up a saucy instance in azure. You should be able to get in with your github keys to 'smoser@smoser1029s.cloudapp.net'

There, I've done 'apt-get install python-requests squid3'. I put the example testcase above in 'issue-1686.py' and I can reproduce it there.

I'll leave that machine up till thursday or friday. If I don't hear back from you by then, I'll kill it.

Thanks for looking into it.

rbtcollins commented 10 years ago

How big is the object? You might be triggering read-ahead or something, such that the second request blocks until the first is complete.

The open/close/open idiom is unsafe in general on HTTP as there is no guarantee the second request will go to the same server with the same content.

Also it's super slow on high latency links, so I highly recommend one open, stream, close.

rbtcollins commented 10 years ago

Oh, also the output code is nonsafe: you should write to a temporary file and rename.

rbtcollins commented 10 years ago

I've added some print statements:

cat t.py

!/usr/bin/python

BUFLEN = 1024 * 10

import requests, sys if len(sys.argv) > 3: BUFLEN = int(sys.argv[3])

print "probing" cq = requests.get(sys.argv[1], stream=True) cq.raw.read(1024) cq.close()

print "getting" req = requests.get(sys.argv[1], stream=True) with open(sys.argv[2], "wb") as wfp: while True: print "read" buf = req.raw.read(BUFLEN) print "write" wfp.write(buf) if len(buf) != BUFLEN: break req.close()

rbtcollins commented 10 years ago

Output: python ./t.py $url out.img probing getting read write read write read write read write read write read write read write read write read write read write read write read write read ^CTraceback (most recent call last): File "./t.py", line 18, in buf = req.raw.read(BUFLEN) File "/tmp/foo/local/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 174, in read data = self._fp.read(amt) File "/usr/lib/python2.7/httplib.py", line 567, in read s = self.fp.read(amt) File "/usr/lib/python2.7/socket.py", line 380, in read data = self._sock.recv(left) KeyboardInterrupt

Lukasa commented 10 years ago

@rbtcollins Just to confirm, are you testing with a proxy?

smoser commented 10 years ago

so it seems like requests and or urllib3 isn't actually closing the connection on close(), which may be by design. The same problem occurs with urllib2 if you do not close close the urllib2.urlopen handle. I opened squid bug at http://bugs.squid-cache.org/show_bug.cgi?id=3949 .

rbtcollins commented 10 years ago

@Lukasa I misinterepreted and yes it was hanging.

The cause is as smoser says it's a squid bug when you have a concurrent request for one url with the first reader stalled neither closing nor reading.

This was caused because

cq = requests.get(sys.argv[1], stream=True) cq.raw.read(1024) cq.close()

doesn't actually close the socket for cq until you do 'del cq'. Thats a bug IMO.

Lukasa commented 10 years ago

@smoser Good spot. We very deliberately do not close the connection. We provide connection-pooling via urllib3, because opening a new connection each time is nuts. =D

sigmavirus24 commented 10 years ago

This is absolutely not a bug in requests. As @Lukasa rightly states this is a very intentional design decision.

Lukasa commented 10 years ago

My reading of this issue is that we've resolved it as an awkward behaviour in SQUID. I'm closing this issue unless anyone disagrees.