gnocchixyz / gnocchi

Timeseries database
Apache License 2.0
300 stars 85 forks source link

Swift connection pool size is too small #509

Open jd opened 6 years ago

jd commented 6 years ago

When Swift is used as an incoming measures driver, the number of connections opened by httplib3 is too small to succeed. On large batches, the error occurs:

[Wed Nov 22 16:17:50.622612 2017] [:error] [pid 31427] 2017-11-22 16:17:50,622 [31427] WARNING  urllib3.connectionpool: Connection pool is full, discarding connection: 10.41.11.12
[Wed Nov 22 16:17:50.625786 2017] [:error] [pid 31427] [remote ::1:80] mod_wsgi (pid=31427): Exception occurred processing WSGI script '/bin/gnocchi-api'.
[Wed Nov 22 16:17:50.625846 2017] [:error] [pid 31427] [remote ::1:80] Traceback (most recent call last):
[Wed Nov 22 16:17:50.625898 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/webob/dec.py", line 131, in __call__
[Wed Nov 22 16:17:50.625950 2017] [:error] [pid 31427] [remote ::1:80]     resp = self.call_func(req, *args, **self.kwargs)
[Wed Nov 22 16:17:50.625961 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/webob/dec.py", line 196, in call_func
[Wed Nov 22 16:17:50.625975 2017] [:error] [pid 31427] [remote ::1:80]     return self.func(req, *args, **kwargs)
[Wed Nov 22 16:17:50.625991 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/oslo_middleware/base.py", line 131, in __call__
[Wed Nov 22 16:17:50.626008 2017] [:error] [pid 31427] [remote ::1:80]     response = req.get_response(self.application)
[Wed Nov 22 16:17:50.626016 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 1327, in send
[Wed Nov 22 16:17:50.626031 2017] [:error] [pid 31427] [remote ::1:80]     application, catch_exc_info=False)
[Wed Nov 22 16:17:50.626039 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 1291, in call_application
[Wed Nov 22 16:17:50.626052 2017] [:error] [pid 31427] [remote ::1:80]     app_iter = application(self.environ, start_response)
[Wed Nov 22 16:17:50.626060 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/paste/urlmap.py", line 216, in __call__
[Wed Nov 22 16:17:50.626075 2017] [:error] [pid 31427] [remote ::1:80]     return app(environ, start_response)
[Wed Nov 22 16:17:50.626109 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/webob/dec.py", line 131, in __call__
[Wed Nov 22 16:17:50.626127 2017] [:error] [pid 31427] [remote ::1:80]     resp = self.call_func(req, *args, **self.kwargs)
[Wed Nov 22 16:17:50.626145 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/webob/dec.py", line 196, in call_func
[Wed Nov 22 16:17:50.626159 2017] [:error] [pid 31427] [remote ::1:80]     return self.func(req, *args, **kwargs)
[Wed Nov 22 16:17:50.626168 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/oslo_middleware/base.py", line 131, in __call__
[Wed Nov 22 16:17:50.626182 2017] [:error] [pid 31427] [remote ::1:80]     response = req.get_response(self.application)
[Wed Nov 22 16:17:50.626190 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 1327, in send
[Wed Nov 22 16:17:50.626204 2017] [:error] [pid 31427] [remote ::1:80]     application, catch_exc_info=False)
[Wed Nov 22 16:17:50.626212 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 1291, in call_application
[Wed Nov 22 16:17:50.626226 2017] [:error] [pid 31427] [remote ::1:80]     app_iter = application(self.environ, start_response)
[Wed Nov 22 16:17:50.626235 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/webob/exc.py", line 1162, in __call__
[Wed Nov 22 16:17:50.626263 2017] [:error] [pid 31427] [remote ::1:80]     return self.application(environ, start_response)
[Wed Nov 22 16:17:50.626274 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/app.py", line 71, in __call__
[Wed Nov 22 16:17:50.626290 2017] [:error] [pid 31427] [remote ::1:80]     return self.app(environ, start_response)
[Wed Nov 22 16:17:50.626300 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/pecan/middleware/recursive.py", line 56, in __call__
[Wed Nov 22 16:17:50.626316 2017] [:error] [pid 31427] [remote ::1:80]     return self.application(environ, start_response)
[Wed Nov 22 16:17:50.626325 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/pecan/core.py", line 840, in __call__
[Wed Nov 22 16:17:50.626339 2017] [:error] [pid 31427] [remote ::1:80]     return super(Pecan, self).__call__(environ, start_response)
[Wed Nov 22 16:17:50.626348 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/pecan/core.py", line 683, in __call__
[Wed Nov 22 16:17:50.626361 2017] [:error] [pid 31427] [remote ::1:80]     self.invoke_controller(controller, args, kwargs, state)
[Wed Nov 22 16:17:50.626376 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/pecan/core.py", line 574, in invoke_controller
[Wed Nov 22 16:17:50.626401 2017] [:error] [pid 31427] [remote ::1:80]     result = controller(*args, **kwargs)
[Wed Nov 22 16:17:50.626409 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/gnocchi/rest/api.py", line 1588, in post
[Wed Nov 22 16:17:50.626422 2017] [:error] [pid 31427] [remote ::1:80]     for metric in known_metrics))
[Wed Nov 22 16:17:50.626435 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/gnocchi/incoming/__init__.py", line 128, in add_measures_batch
[Wed Nov 22 16:17:50.626450 2017] [:error] [pid 31427] [remote ::1:80]     in six.iteritems(metrics_and_measures)))
[Wed Nov 22 16:17:50.626458 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/gnocchi/utils.py", line 312, in parallel_map
[Wed Nov 22 16:17:50.626471 2017] [:error] [pid 31427] [remote ::1:80]     return list(executor.map(lambda args: fn(*args), list_of_args))
[Wed Nov 22 16:17:50.626479 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/concurrent/futures/_base.py", line 605, in result_iterator
[Wed Nov 22 16:17:50.626493 2017] [:error] [pid 31427] [remote ::1:80]     yield future.result()
[Wed Nov 22 16:17:50.626507 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/concurrent/futures/_base.py", line 429, in result
[Wed Nov 22 16:17:50.626519 2017] [:error] [pid 31427] [remote ::1:80]     return self.__get_result()
[Wed Nov 22 16:17:50.626527 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/concurrent/futures/thread.py", line 62, in run
[Wed Nov 22 16:17:50.626540 2017] [:error] [pid 31427] [remote ::1:80]     result = self.fn(*self.args, **self.kwargs)
[Wed Nov 22 16:17:50.626548 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/gnocchi/utils.py", line 312, in <lambda>
[Wed Nov 22 16:17:50.626560 2017] [:error] [pid 31427] [remote ::1:80]     return list(executor.map(lambda args: fn(*args), list_of_args))
[Wed Nov 22 16:17:50.626567 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/gnocchi/incoming/swift.py", line 58, in _store_new_measures
[Wed Nov 22 16:17:50.626580 2017] [:error] [pid 31427] [remote ::1:80]     data)
[Wed Nov 22 16:17:50.626588 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1829, in put_object
[Wed Nov 22 16:17:50.626602 2017] [:error] [pid 31427] [remote ::1:80]     response_dict=response_dict)
[Wed Nov 22 16:17:50.626610 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1678, in _retry
[Wed Nov 22 16:17:50.626627 2017] [:error] [pid 31427] [remote ::1:80]     service_token=self.service_token, **kwargs)
[Wed Nov 22 16:17:50.626635 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1317, in put_object
[Wed Nov 22 16:17:50.626647 2017] [:error] [pid 31427] [remote ::1:80]     conn.request('PUT', path, contents, headers)
[Wed Nov 22 16:17:50.626654 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 439, in request
[Wed Nov 22 16:17:50.626666 2017] [:error] [pid 31427] [remote ::1:80]     files=files, **self.requests_args)
[Wed Nov 22 16:17:50.626673 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 422, in _request
[Wed Nov 22 16:17:50.626685 2017] [:error] [pid 31427] [remote ::1:80]     return self.request_session.request(*arg, **kwarg)
[Wed Nov 22 16:17:50.626693 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
[Wed Nov 22 16:17:50.626707 2017] [:error] [pid 31427] [remote ::1:80]     resp = self.send(prep, **send_kwargs)
[Wed Nov 22 16:17:50.626714 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
[Wed Nov 22 16:17:50.626726 2017] [:error] [pid 31427] [remote ::1:80]     r = adapter.send(request, **kwargs)
[Wed Nov 22 16:17:50.626733 2017] [:error] [pid 31427] [remote ::1:80]   File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 490, in send
[Wed Nov 22 16:17:50.626746 2017] [:error] [pid 31427] [remote ::1:80]     raise ConnectionError(err, request=request)
[Wed Nov 22 16:17:50.626771 2017] [:error] [pid 31427] [remote ::1:80] ConnectionError: ('Connection aborted.', BadStatusLine("''",))
sileht commented 6 years ago

urllib3 connection pool size is handled by python-request and can be configured like.

a = requests.adapters.HTTPAdapter(pool_maxsize=N)
s = requests.Session()
s.mount('https://', a)
s.mount('https://', a)

But the request.Session() used by swiftclient is not customisable: https://github.com/openstack/python-swiftclient/blob/master/swiftclient/client.py#L384

swiftclient have to be fixed first...

jd commented 6 years ago

There is an upstream bug opened here https://bugs.launchpad.net/python-swiftclient/+bug/1671949

zaitcev commented 6 years ago

Please pardon my lack of understanding, but I do not follow how the underlying symptom ended with a desire to make the connection pool larger. I say that because, I think, BadStatusLine('') means that the client opens the TCP connection, possibly establishes TLS, sends PUT down, but a read() from the socket ends with EOF. Either Swift's own proxy or its LB (HAproxy in case of RHOS) closed the connection without returning anything.

If we look into urllib3/connectionpool.py:HTTPConnectionPool.urlopen(), then indeed there's no telling who's raising the HTTPException - _get_conn() or _make_request(). Both of them end with "Connection aborted." But it seems to me that only _make_request() can produce BadStatusLine('').

But if so, sizing the pool isn't going to do Gnocchi any good. The Swift or HAproxy which it is using must be tuned for larger number of connections. Heck I'd have a look if someone's segfaulting. As long as it's only a traceback, Swift is good at returning at least 500. But in this case, see the scenario above.

Julien, is there something that am I missing here?

jd commented 6 years ago

Either Swift's own proxy or its LB (HAproxy in case of RHOS) closed the connection without returning anything.

I did not have time to dig into that back then, but the swift CLI tool was working properly, and gnocchi-upgrade ran properly. So Swift was functioning.

I don't know what kind of deployment it was and @Prophidys could give us a hint here. We may be able to access the system to take a look

The thing is that as you say, there is a high number of connections opened by Gnocchi due to the way it parallelizes the tasks. Maybe one the source of the problem is that Swift proxy was not configured to handle that many connections (though I kinda doubt it), though the connection pool not being configurable is still a problem :)