kevin1024 / vcrpy

Automatically mock your HTTP interactions to simplify and speed up testing
MIT License
2.69k stars 387 forks source link

fails when working through proxy (as instructed by http_proxy env var): invalid literal for int() with base 10: '3128http' #214

Open yarikoptic opened 9 years ago

yarikoptic commented 9 years ago

as of 1.6.1 (as in Debian)

$> echo $http_proxy
http://localhost:3128

$> venv-tests/bin/nosetests -s -v datalad/crawler/nodes/tests/test_misc.py
datalad.crawler.nodes.tests.test_misc.test_get_deposition_filename ... ok
datalad.crawler.nodes.tests.test_misc.test_assign ... SKIP: TODO
datalad.crawler.nodes.tests.test_misc.test_rename ... SKIP: TODO

----------------------------------------------------------------------
Ran 3 tests in 14.059s

OK (SKIP=2)

$> venv-tests/bin/nosetests -s -v datalad/crawler/nodes/tests/test_misc.py
datalad.crawler.nodes.tests.test_misc.test_get_deposition_filename ... ERROR
datalad.crawler.nodes.tests.test_misc.test_assign ... SKIP: TODO
datalad.crawler.nodes.tests.test_misc.test_rename ... SKIP: TODO

======================================================================
ERROR: datalad.crawler.nodes.tests.test_misc.test_get_deposition_filename
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/yoh/proj/datalad/datalad/venv-tests/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/lib/python2.7/dist-packages/vcr/cassette.py", line 95, in __call__
    return function(*args, **kwargs)
  File "/home/yoh/proj/datalad/datalad/datalad/crawler/nodes/tests/test_misc.py", line 22, in test_get_deposition_filename
    output = list(get_deposition_filename(**input))
  File "/home/yoh/proj/datalad/datalad/datalad/crawler/nodes/misc.py", line 85, in get_deposition_filename
    yield updated(data, {'filename': get_url_deposition_filename(data['url'])})
  File "/home/yoh/proj/datalad/datalad/datalad/support/network.py", line 46, in get_url_deposition_filename
    r = retry_urlopen(request)
  File "/home/yoh/proj/datalad/datalad/datalad/support/network.py", line 113, in retry_urlopen
    return urlopen(url)
  File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/dist-packages/vcr/stubs/__init__.py", line 220, in getresponse
    if self.cassette.can_play_response_for(self._vcr_request):
  File "/usr/lib/python2.7/dist-packages/vcr/cassette.py", line 199, in can_play_response_for
    return request and request in self and \
  File "/usr/lib/python2.7/dist-packages/vcr/cassette.py", line 270, in __contains__
    for index, response in self._responses(request):
  File "/usr/lib/python2.7/dist-packages/vcr/cassette.py", line 194, in _responses
    if requests_match(request, stored_request, self._match_on):
  File "/usr/lib/python2.7/dist-packages/vcr/matchers.py", line 89, in requests_match
    matches = [(m(r1, r2), m) for m in matchers]
  File "/usr/lib/python2.7/dist-packages/vcr/matchers.py", line 25, in port
    return r1.port == r2.port
  File "/usr/lib/python2.7/dist-packages/vcr/request.py", line 67, in port
    port = parse_uri.port
  File "/usr/lib/python2.7/urlparse.py", line 113, in port
    port = int(port, 10)
ValueError: invalid literal for int() with base 10: '3128http'
(Pdb) up
> /usr/lib/python2.7/dist-packages/vcr/request.py(67)port()
-> port = parse_uri.port
(Pdb) up
> /usr/lib/python2.7/dist-packages/vcr/matchers.py(25)port()
-> return r1.port == r2.port
(Pdb) up
> /usr/lib/python2.7/dist-packages/vcr/matchers.py(89)requests_match()
-> matches = [(m(r1, r2), m) for m in matchers]
(Pdb) p r2
<Request (GET) http://localhost:3128http://human.brain-map.org/api/v2/well_known_file_download/157722290>
melinath commented 8 years ago

I just ran into this as well. Proxy is broken on vcr.

melinath commented 8 years ago

Side note: even if it were working, the stored version seems to be stored under a url that doesn't include the proxy information, which will probably make them not match.

-> return r1.port == r2.port
(Pdb) r1
<Request (GET) https://www.linkedin.com/>
(Pdb) r2
<Request (GET) http://proxy.crawlera.com:8010http://linkedin.com/>
agriffis commented 8 years ago

@yarikoptic @melinath Did either of you look into how to fix this? Want to make a PR?

yarikoptic commented 8 years ago

If I did, I would have sent a PR I think... Since there is none, I guess I have just decided to whine instead ;-)

Toilal commented 8 years ago

Same problem here.

Toilal commented 8 years ago

I've run the issue in the debuger, and problem occurs in stubs module at method VCRConnection._uri().

This method should return the absolute URI of an URL, right ... But when a proxy is defined, self.real_connection.host doesn't holds the destination host but the proxy host.

In units tests, for http://httpbin.org/ as input, it returns http://http-proxy:8080http://httpbin.org/ when the system proxy is configured as http://http-proxy:8080. It's obviously wrong.

delijati commented 8 years ago

It's probably not the right place to fix it but it works for me vcr/stubs/__init__.py:

    def request(self, method, url, body=None, headers=None):
        '''Persist the request metadata in self._vcr_request'''
        import os
        http_proxy = os.environ.get("http_proxy")
        https_proxy = os.environ.get("https_proxy")

        if http_proxy:
            uri = self._uri(url)
            uri = uri.replace(http_proxy, "")
        if https_proxy:
            uri = self._uri(url)
            uri = uri.replace(https_proxy, "")

        self._vcr_request = Request(
            method=method,
            uri=uri,
            body=body,
            headers=headers or {}
        )
        log.debug('Got {0}'.format(self._vcr_request))

        # Note: The request may not actually be finished at this point, so
        # I'm not sending the actual request until getresponse().  This
        # allows me to compare the entire length of the response to see if it
        # exists in the cassette.
delijati commented 8 years ago

Any other suggestions?

yarikoptic commented 8 years ago

Since noone followedup, I will just dump the distilled version of my time somewhat wasted in trying to get acquainted with code and logic here.

  1. I guess somewhere vcr should explicitly check/acknowledge use of proxy. urllib2.Request has has_proxy() which in these cases reports True, and req.host points to the proxy, while header seems to contain original address and req.get_full_url() also provides original url:
1193    
(Pdb) 
1194            try:
1195                h.request(req.get_method(), req.get_selector(), req.data, headers)
1196            except socket.error, err: # XXX what error?
1197                h.close()
1198                raise URLError(err)
1199            else:
1200                try:
1201                    r = h.getresponse(buffering=True)
1202                except TypeError: # buffering kw not supported
1203 ->                 r = h.getresponse()
1204    
(Pdb) p h
<vcr.patch.VCRHTTPConnection/tmp/pytest-of-yoh/pytest-10/test_patched_content0/synopsis.yaml object at 0x7f2edfcb0cd0>
(Pdb) p req
<urllib2.Request instance at 0x7f2edfc65e60>
*(Pdb) p req.host    
'localhost:3128'
*(Pdb) p req.has_proxy()
True
*(Pdb) p req.origin_req_host
'127.0.0.1'
*(Pdb) p req.port          
None
*(Pdb) p req.get_full_url()
'http://127.0.0.1:37441'
(Pdb) p host
'localhost:3128'
(Pdb) p headers
{'Host': '127.0.0.1:37441', 'User-Agent': 'Python-urllib/2.7', 'Connection': 'close'}
(Pdb) p h
<vcr.patch.VCRHTTPConnection/tmp/pytest-of-yoh/pytest-10/test_patched_content0/synopsis.yaml object at 0x7f2edfcb0cd0>
  1. but actually it is probably the VCRConnection._uri to somewhat massage here, since that is the one which re-creates a new, that somewhat bogus url, out of protocol/host/port and a ... url ;) apparently what is usually expected to be contained in the url argument seems actually to be the (absolute) path on the server (+all the ?params.). But when we are dealing with proxy, then what is provided from urllib2 is url is the url and host/port refer to the proxy address.

So I wonder what is the logical thing vcr should do here ?

  1. record result of the transaction as if there were no proxy (sounds more useful, so tape could be played without proxy set in env vars etc)?
  2. or record transaction to the proxy (sounds more to the point of vcrpy)?
kevin1024 commented 8 years ago

Thanks for the investigation.

I think VCR should be able to run in an environment with a proxy as well as an environment without a proxy using the same cassettes. I could see a scenario where a developer has a proxy but a CI server does not, for example.

simon-weber commented 8 years ago

I'll take a look at this. One of our integrations requires ip whitelisting, so we need support for this to be able to record from our autoscaled CI hosts (which sit behind a whitelisted proxy).

kevin1024 commented 8 years ago

@simon-weber thanks! This issue has been annoyingly persistent.

simon-weber commented 7 years ago

Huh, this is more complicated than I thought. I was hoping we could just override the connection's host with the contents of the Host header. This works fine for urllib2, but it looks like requests doesn't actually set it.

The host we want is included in url, but we can't just parse it out in the general case since someone could theoretically request the url http://host.com/http://host.com -- ie, we can't distinguish between being proxied and getting a bizarre path.

Here are our options, I think:

Thoughts?

simon-weber commented 7 years ago

Looking at the environment turned out pretty easy, and it looks like windows tends to respect the http_proxy env var as well. I'm going to test this out with our setup; in the meantime, suggestions on how to test this would be good (eg do you want a proxy running during tests to prove these still get proxied?).

simon-weber commented 7 years ago

I ran into a really hairy problem with https proxying, where things will break in the combination of:

For some reason, this causes urllib3/httplib to incorrectly set the proxied connection's host to the tunneled host, resulting in a connection to the tunneled host, with a CONNECT to the same host. In my case, this causes the server to disconnect immediately without a response. I'm looking to see if there's a spot where I can just set it back to work around it, or if I need to actually figure out what's causing this.

simon-weber commented 7 years ago

Ok, I got everything working for our setup. https://github.com/kevin1024/vcrpy/commit/80550dbb64052aec857c391d1654a33495f8c80f was the solution to what I described above; is_connection_dropped() and auto_open == 0 must be true for urllib3 to know that the connection has been mucked with by httplib and tear it down properly.

@kevin1024 want me to send a PR after I rebase? Or did you want to review the approach and talk about tests first?

kevin1024 commented 7 years ago

Wow, thanks for your work on this. Tests are a pain. I wonder how hard it would be to build a pytest fixture that is a simple proxy kind of like pytest-httpbin. Just thinking out loud here.

kevin1024 commented 7 years ago

Ehhh not to excited about adding some kind of pysocks dependency to run tests

kevin1024 commented 7 years ago

Could use an open online proxy to run a couple tests but that will make tests pretty flakey

agriffis commented 7 years ago

Ehhh not to excited about adding some kind of pysocks dependency to run tests

Why is that? IMHO dependencies for devel/tests are totally fine. I only worry about adding deps when it affects install. In particular, I'd rather add dev deps to keep things self-contained than depend on an external server.

colonelpanic8 commented 7 years ago

+1 to @agriffis' sentiment

kevin1024 commented 7 years ago

Why is that? IMHO dependencies for devel/tests are totally fine.

Fair enough. It just makes it harder for the distros that package VCR since they then need to create packages for the dependencies, even though they are only used to run tests, since it some of them seem to run the test suite as part of their build process.

That said maybe we can add a little proxy server but mark the tests as optional or something. Run them in our CI server but if a distro doesn't want to package the extra dependency then they can just ignore those tests.

agriffis commented 7 years ago

It just makes it harder for the distros that package VCR since they then need to create packages for the dependencies, even though they are only used to run tests, since it some of them seem to run the test suite as part of their build process.

Huh, interesting. Despite coming from a distro background, I tend to think of Python packages as something installed in virtualenv and never on the distro, especially testing tools like VCR.

That said maybe we can add a little proxy server but mark the tests as optional or something. Run them in our CI server but if a distro doesn't want to package the extra dependency then they can just ignore those tests.

Either that or wait for a distro to complain. Are you sure this problem is more than just theoretical?

P.S. Sorry for spouting an opinion but not offering to help. I'm busy with other things at the moment but will probably cycle around to VCR again in the future, especially when I'm building tests again using it.

simon-weber commented 7 years ago

I'll poke around for a pure-python proxy we can add as a test dependency.

nysthee commented 7 years ago

Hey some of my cassettes re-record when using on a CI environment could this be related to this?

simon-weber commented 7 years ago

Possibly. You could check the environment when your code runs -- this problem should only occur with proxies set via http_proxy and similar variables.

samuelfekete commented 7 years ago

I have submitted a pull request which fixes this - at least for my use-case, which is boto3 using urllib3.

hartwork commented 7 years ago

This issue also hits with test suites utilizing VCR that are executed from within a Debian package building process (say during debuild -us -uc) since PyBuild of dh-python uses http[s]://127.0.0.1:9/ for http_proxy and https_proxy. The precise error is:

ValueError: invalid literal for int() with base 10: '9http'

For the next one with this problem where patching VCR is not an option, here's a demo of my current workaround:

# Copyright (C) 2017 Sebastian Pipping
# Licensed under CC0 1.0 Public Domain Dedication
from unittest import TestCase

from mock import patch
from vcr_unittest import VCRMixin  # https://github.com/agriffis/vcrpy-unittest

class MyTest(VCRMixin, TestCase):

    # dh-python interferes with VCR, so we need to patch http_proxy away
    # $ fgrep 127.0.0.1:9 /usr/share/dh-python/pybuild
    #         env['http_proxy'] = 'http://127.0.0.1:9/'
    #         env['https_proxy'] = 'https://127.0.0.1:9/'
    _proxy_patcher = patch('six.moves.urllib.request.Request.set_proxy')

    @classmethod
    def setUpClass(cls):
        cls._proxy_patcher.start()
        super(MyTest, cls).setUpClass()

    @classmethod
    def tearDownClass(cls):
        cls._proxy_patcher.stop()
        super(MyTest, cls).tearDownClass()

Best, Sebastian

neozenith commented 4 years ago

A lot of changes have happened to VCRpy since this ticket was opened. As this ticket has become stale and we have a lot of old tickets that need to be groomed, I'm closing this for now to make it easier to see what is still relevant.

However if it is still needed, please feel free to re-open or create a new ticket.

Thanks! 🙏

hartwork commented 4 years ago

I still see tests fail http_proxy when is in the environment, e.g. http_proxy=http://localhost:3128 PYTHONPATH=. pytest. So for something like Debian packaging, https://github.com/kevin1024/vcrpy/issues/214#issuecomment-315445044 still seems to apply. @neozenith should I turn that into a new ticket?

neozenith commented 4 years ago

Hi, I just wanted to update active PRs and issues to let you know I will no longer be an active maintainer on this project.

I recently changed jobs where I no longer use VCRpy or Python at all so I don't have the bandwidth anymore. I won't be monitoring issues or PRs either.

I will have to delegate to other maintainers or ask for some contributors to step up.

I know the Azure CLI uses VCRpy so it would be nice if they could spare some sprint time to maintain this project as it was invaluable in our CI when I was using it.

Kind regards, Josh

avalentino commented 4 months ago

Any new on this issue? I'm currently experienting the same issue.