scopely-devops / skew

Apache License 2.0
243 stars 70 forks source link

[BUG?] Ungraceful handling of blocking IO #181

Open ghost opened 2 years ago

ghost commented 2 years ago

Summary

So I've been playing with skew for a week or so now and whilst it principally works for what I want it to do at a functional level, I've found that something is causing it to handle blocking IO very poorly (not at all in fact)

The tests I've been running are to scan my own AWS account for S3 buckets. Skew works perfectly for finding the 3 buckets I expected it to, but then never returns.

At first I thought I was being impatient, so I decided to leave it run overnight and through the morning. The function still never returned. I understand that scanning all of AWS is a potentially non-trivial task, but more than 16 hours? Something must be up.

Investigation

The first thing I noticed trying to work out what was going on here was (when it worked) the dump from the keyboard interrupt when running in the console:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ares/.local/lib/python3.8/site-packages/skew/arn/__init__.py", line 323, in __iter__
    for scheme in self.scheme.enumerate(context, **self.kwargs):
  File "/home/ares/.local/lib/python3.8/site-packages/skew/arn/__init__.py", line 247, in enumerate
    for provider in self._arn.provider.enumerate(
  File "/home/ares/.local/lib/python3.8/site-packages/skew/arn/__init__.py", line 232, in enumerate
    for service in self._arn.service.enumerate(
  File "/home/ares/.local/lib/python3.8/site-packages/skew/arn/__init__.py", line 217, in enumerate
    for region in self._arn.region.enumerate(
  File "/home/ares/.local/lib/python3.8/site-packages/skew/arn/__init__.py", line 198, in enumerate
    for account in self._arn.account.enumerate(
  File "/home/ares/.local/lib/python3.8/site-packages/skew/arn/__init__.py", line 145, in enumerate
    for resource in self._arn.resource.enumerate(
  File "/home/ares/.local/lib/python3.8/site-packages/skew/arn/__init__.py", line 127, in enumerate
    resources.extend(resource_cls.enumerate(
  File "/home/ares/.local/lib/python3.8/site-packages/skew/resources/aws/s3.py", line 27, in enumerate
    resources = super(Bucket, cls).enumerate(arn, region, account,
  File "/home/ares/.local/lib/python3.8/site-packages/skew/resources/resource.py", line 54, in enumerate
    data = client.call(enum_op, query=path, **kwargs)
  File "/home/ares/.local/lib/python3.8/site-packages/skew/awsclient.py", line 127, in call
    data = op(**kwargs)
  File "/home/ares/.local/lib/python3.8/site-packages/botocore/client.py", line 386, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ares/.local/lib/python3.8/site-packages/botocore/client.py", line 691, in _make_api_call
    http, parsed_response = self._make_request(
  File "/home/ares/.local/lib/python3.8/site-packages/botocore/client.py", line 711, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/home/ares/.local/lib/python3.8/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/home/ares/.local/lib/python3.8/site-packages/botocore/endpoint.py", line 134, in _send_request    success_response, exception = self._get_response(
  File "/home/ares/.local/lib/python3.8/site-packages/botocore/endpoint.py", line 166, in _get_response    success_response, exception = self._do_get_response(
  File "/home/ares/.local/lib/python3.8/site-packages/botocore/endpoint.py", line 200, in _do_get_response
    http_response = self._send(request)
  File "/home/ares/.local/lib/python3.8/site-packages/botocore/endpoint.py", line 269, in _send
    return self.http_session.send(request)
  File "/home/ares/.local/lib/python3.8/site-packages/botocore/httpsession.py", line 344, in send
    urllib_response = conn.urlopen(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.8/http/client.py", line 1344, in getresponse
    response.begin()
  File "/usr/lib/python3.8/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.8/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt
>>>

Diving into the codebase skew wraps around boto3 which in turn uses (to no massive surprise) urllib3 to manage its requests to AWS itself.

So cool, it's just urllib3 under the hood and it's timing out. Well urllib3 doesn't have a default timeout for requests, but you can certainly set one. So using socket.setdefaulttimeout(60) set the timeout to 60 seconds (pretty fair imo) and suddenly skew (or more specifically boto3) doesn't work at all. Even the original buckets that I expect to see listed there are not returned. What makes this slightly more confusing is that it doesn't actually return a timeout exception; I'm not even convinced the requests are made.

With messing with the timeout not working I thought, what if I can just force the thread to exit by force? So using various forms of inspiration from this thread, I tried forcing skew to exit early.

Context Manager

This context manager didn't work at all, skew just didn't stop

@contextmanager
def timeout(duration):
    def timeout_handler(signum, frame):
        raise BlockingIOError(f'Function timed out after {duration} seconds')

    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(duration)
    yield
    signal.alarm(0)

Thread Decorator

This thread decorator saw better results in that the quit_function call is actually made (the call to log.info is made) but again, skew doesn't exit gracefully and just hangs.

def quit_function(fn_name):
    log.info(f'{fn_name} took too long')
    thread.interrupt_main()  # raises KeyboardInterrupt

def exit_after(s):
    """
    use as decorator to exit process if
    function takes longer than s seconds
    """

    def outer(fn):
        def inner(*args, **kwargs):
            timer = threading.Timer(s, quit_function, args=[fn.__name__])
            timer.start()
            try:
                result = fn(*args, **kwargs)
            finally:
                timer.cancel()
            return result

        return inner

    return outer

I'm honestly running out of ideas as to where to go looking for this one. My use case is actually to be able to run skew on a serverless function endpoint that I can just call, but these have well defined timeouts that I just can't seem to get skew to obey.

Any ideas?

ghost commented 2 years ago

Okay so I've been diving into this one a little more and found part of a root cause.

As skew iterates through the required calls to find each bucket, if it encounters a region which the user has not enabled (such as eu-south-1 which comes disabled by default) then it begins to endlessly loop. This is probably a failure mode that should be caught and handled.

Log sample. Once it hits the regions which are not enabled by default, this log peice will loop endlessly:

DEBUG:botocore.hooks:Event needs-retry.s3.ListBuckets: calling handler <botocore.retryhandler.RetryHandler object at 0x7fedbc4545e0>
DEBUG:botocore.retryhandler:No retry needed.
DEBUG:botocore.hooks:Event needs-retry.s3.ListBuckets: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x7fedbc454040>>
DEBUG:botocore.hooks:Event before-parameter-build.s3.ListBuckets: calling handler <function validate_bucket_name at 0x7fedc124d4c0>
DEBUG:botocore.hooks:Event before-parameter-build.s3.ListBuckets: calling handler <bound method S3RegionRedirector.redirect_from_cache of <botocore.utils.S3RegionRedirector object at 0x7fedbc454040>>
DEBUG:botocore.hooks:Event before-parameter-build.s3.ListBuckets: calling handler <bound method S3ArnParamHandler.handle_arn of <botocore.utils.S3ArnParamHandler object at 0x7fedbc454130>>
DEBUG:botocore.hooks:Event before-parameter-build.s3.ListBuckets: calling handler <function generate_idempotent_uuid at 0x7fedc124d310>
DEBUG:botocore.hooks:Event before-call.s3.ListBuckets: calling handler <function add_expect_header at 0x7fedc124d820>
DEBUG:botocore.hooks:Event before-call.s3.ListBuckets: calling handler <bound method S3RegionRedirector.set_request_url of <botocore.utils.S3RegionRedirector object at 0x7fedbc454040>>
DEBUG:botocore.hooks:Event before-call.s3.ListBuckets: calling handler <function inject_api_version_header_if_needed at 0x7fedc124bb80>
DEBUG:botocore.endpoint:Making request for OperationModel(name=ListBuckets) with params: {'url_path': '/', 'query_string': '', 'method': 'GET', 'headers': {'User-Agent': 'Boto3/1.18.51 Python/3.8.10 Linux/4.19.128-microsoft-standard Botocore/1.21.54'}, 'body': b'', 'url': 'https://s3.ap-east-1.amazonaws.com/', 'context': {'client_region': 'ap-east-1', 'client_config': <botocore.config.Config object at 0x7fedbf1ee580>, 'has_streaming_input': False, 'auth_type': None, 'signing': {'bucket': None}}}
DEBUG:botocore.hooks:Event request-created.s3.ListBuckets: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7fedbf1ee160>>
DEBUG:botocore.hooks:Event choose-signer.s3.ListBuckets: calling handler <bound method S3EndpointSetter.set_signer of <botocore.utils.S3EndpointSetter object at 0x7fedbc454370>>
DEBUG:botocore.hooks:Event choose-signer.s3.ListBuckets: calling handler <function set_operation_specific_signer at 0x7fedc124d1f0>
DEBUG:botocore.hooks:Event before-sign.s3.ListBuckets: calling handler <bound method S3EndpointSetter.set_endpoint of <botocore.utils.S3EndpointSetter object at 0x7fedbc454370>>
DEBUG:botocore.auth:Calculating signature using v4 auth.
DEBUG:botocore.auth:CanonicalRequest:
GET
/

host:s3.ap-east-1.amazonaws.com
x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
x-amz-date:20211021T013437Z

host;x-amz-content-sha256;x-amz-date
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
DEBUG:botocore.auth:StringToSign:
AWS4-HMAC-SHA256
20211021T013437Z
20211021/ap-east-1/s3/aws4_request
7bd4e093a22164ff1ce64436fab33b99c85fe70d05dd806716821d481178b3f3
DEBUG:botocore.auth:Signature:
88c3fe15890b7be66d756b72cfe21e6f93d286e786aa4323e92ac94ca0318d25
DEBUG:botocore.endpoint:Sending http request: <AWSPreparedRequest stream_output=False, method=GET, url=https://s3.ap-east-1.amazonaws.com/, headers={'User-Agent': b'Boto3/1.18.51 Python/3.8.10 Linux/4.19.128-microsoft-standard Botocore/1.21.54', 'X-Amz-Date': b'20211021T013437Z', 'X-Amz-Content-SHA256': b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'Authorization': b'AWS4-HMAC-SHA256 Credential=AKIAZTZ53RGMUZAGRQVN/20211021/ap-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=88c3fe15890b7be66d756b72cfe21e6f93d286e786aa4323e92ac94ca0318d25'}>
DEBUG:botocore.httpsession:Certificate path: /etc/ssl/certs/ca-certificates.crt
DEBUG:urllib3.connectionpool:https://s3.ap-east-1.amazonaws.com:443 "GET / HTTP/1.1" 403 None
DEBUG:botocore.parsers:Response headers: {'x-amz-request-id': '4CAFB390YR4VHYSA', 'x-amz-id-2': 'IvRMi0g1/8tKgcbl2NN2s9LNKJZ3rTpckhtsa9QQUOuZFiXo2wk4MHx9f2SEsE0G3URzSfbLWvY=', 'Content-Type': 'application/xml', 'Transfer-Encoding': 'chunked', 'Date': 'Thu, 21 Oct 2021 01:34:37 GMT', 'Server': 'AmazonS3'}
DEBUG:botocore.parsers:Response body:
b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>InvalidAccessKeyId</Code><Message>The AWS Access Key Id you provided does not exist in our records.</Message><AWSAccessKeyId>AKIAZTZ53RGMUZAGRQVN</AWSAccessKeyId><RequestId>4CAFB390YR4VHYSA</RequestId><HostId>IvRMi0g1/8tKgcbl2NN2s9LNKJZ3rTpckhtsa9QQUOuZFiXo2wk4MHx9f2SEsE0G3URzSfbLWvY=</HostId></Error>'
DEBUG:botocore.hooks:Event needs-retry.s3.ListBuckets: calling handler <botocore.retryhandler.RetryHandler object at 0x7fedbc4545e0>
DEBUG:botocore.retryhandler:No retry needed.
DEBUG:botocore.hooks:Event needs-retry.s3.ListBuckets: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x7fedbc454040>>
DEBUG:botocore.hooks:Event before-parameter-build.s3.ListBuckets: calling handler <function validate_bucket_name at 0x7fedc124d4c0>
DEBUG:botocore.hooks:Event before-parameter-build.s3.ListBuckets: calling handler <bound method S3RegionRedirector.redirect_from_cache of <botocore.utils.S3RegionRedirector object at 0x7fedbc454040>>
DEBUG:botocore.hooks:Event before-parameter-build.s3.ListBuckets: calling handler <bound method S3ArnParamHandler.handle_arn of <botocore.utils.S3ArnParamHandler object at 0x7fedbc454130>>
ghost commented 2 years ago

For those that need it, you can monkeypatch out the bad regions with the following snippet:

from skew.arn import Region

# the keys we use don't propagate to all regions which breaks boto3, so we need to patch which regions we look up
bad_regions = ['eu-south-1', 'ap-east-1', 'af-south-1', 'me-south-1', 'cn-north-1', 'cn-northwest-1']
Region._all_region_names = filter(lambda v: v not in bad_regions, Region._all_region_names)