Request timeout hard-coded.

this-vidor commented 3 years ago

https://github.com/jebena/jebena-python-client/blob/1f3ab185dd817ee057b2b2e20b98a553d2f98b56/jebenaclient/jebenaclient.py#L324

Many requests take longer than 10 seconds! Especially for a generic tool, this should be configured by argument.

jpotter commented 3 years ago

@this-vidor What's the actual error you get? I.e. stack trace or a STDERR message that the client prints?

IIRC, this should be setting just a timeout to open the socket connection, but I can investigate! We surely should support queries themselves that take longer than 10 seconds.

this-vidor commented 3 years ago

In [31]: result = jebenaclient.run_query(query, return_instead_of_raise_on_errors=True)                                       
Socket Timeout error
Failed to fetch response from https://api.enveritas.org/v1/; retrying in 14 seconds; 2 attempts left.
Socket Timeout error
Failed to fetch response from https://api.enveritas.org/v1/; retrying in 32 seconds; 1 attempts left.
Socket Timeout error
---------------------------------------------------------------------------
timeout                                   Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/jebenaclient/jebenaclient.py in _execute_gql_query(api_endpoint, query, variables, allow_insecure_https, api_key_name, api_secret_key, retries_allowed, skip_logging_transient_errors)
    329                 context=context,
--> 330                 timeout=connection_timeout_in_seconds
    331             )  # nosec

/usr/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 

/usr/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    525 
--> 526         response = self._open(req, data)
    527 

/usr/lib/python3.6/urllib/request.py in _open(self, req, data)
    543         result = self._call_chain(self.handle_open, protocol, protocol +
--> 544                                   '_open', req)
    545         if result:

/usr/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:

/usr/lib/python3.6/urllib/request.py in https_open(self, req)
   1360             return self.do_open(http.client.HTTPSConnection, req,
-> 1361                 context=self._context, check_hostname=self._check_hostname)
   1362 

/usr/lib/python3.6/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1320                 raise URLError(err)
-> 1321             r = h.getresponse()
   1322         except:

/usr/lib/python3.6/http/client.py in getresponse(self)
   1330             try:
-> 1331                 response.begin()
   1332             except ConnectionError:

/usr/lib/python3.6/http/client.py in begin(self)
    296         while True:
--> 297             version, status, reason = self._read_status()
    298             if status != CONTINUE:

/usr/lib/python3.6/http/client.py in _read_status(self)
    257     def _read_status(self):
--> 258         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    259         if len(line) > _MAXLINE:

/usr/lib/python3.6/socket.py in readinto(self, b)
    585             try:
--> 586                 return self._sock.recv_into(b)
    587             except timeout:

/usr/lib/python3.6/ssl.py in recv_into(self, buffer, nbytes, flags)
   1011                   self.__class__)
-> 1012             return self.read(nbytes, buffer)
   1013         else:

/usr/lib/python3.6/ssl.py in read(self, len, buffer)
    873         try:
--> 874             return self._sslobj.read(len, buffer)
    875         except SSLError as x:

/usr/lib/python3.6/ssl.py in read(self, len, buffer)
    630         if buffer is not None:
--> 631             v = self._sslobj.read(len, buffer)
    632         else:

timeout: The read operation timed out

jpotter commented 3 years ago

@this-vidor Thanks. This seems really strange, given you're connecting from a high-speed connection on the east coast of the US to an AWS setup that's also US East coast. Can we keep this open for a while and see if you encounter it again? A 10 second timeout for connecting to an ELB is an eternity in this scenario and suggests something else is amiss. I'd rather figure out what's taking longer than 10 seconds elsewhere and resolve that, instead of paving over the issue with a longer timeout here.

this-vidor commented 3 years ago

@jpotter That's fine as long as this is a timeout on the server acknowledging the request from the client, and not a timeout on the client receiving a response from the server.

My highest concern, as with previous issues, is to make sure that when the client makes a request to the server that could mtuate data on it, nothing in the client code gets in the way of being able to read returned information— be it by swallowing error messages, timing-out "early", etc.

jpotter commented 3 years ago

@this-vidor I had a chance to look more into this, and here's what I've learned:

1) urllib2 has ... a history? ... around the timeout issue. See https://bugs.python.org/issue4079 but the gist of it is: the API design for these functions is funky and not so well thought out. Setting the timeout as our code is doing sets it for the whole transaction, not just the open() part. My mistake, and clearly not what we want. We should accept GQL queries that take longer than 10 seconds.

2) We're using urllib2 specifically so that python 2.7 environments can use the client library. (Yes, I know its 2021.) We could migrate to newer, better libraries, but for what we're doing, not worth it (yet).

3) Without any timeout set, the socket connections can block forever. We want some timeout, so that random processes that have some (very very small) probability of not having the socket open and close correctly don't hang forever. Something like 5 minutes seems like a reasonable "crazy upper bound" here ... no GQL query should be taking that long; and yet 5 minutes is short enough that anyone running a thing via cron won't see process stack-ups. So my proposal here is to set this to 300 seconds and go on with things. This isn't a parameter that a consumer of this library should have to know to twiddle; the reason we set it is to prevent stale processes -- i.e. making sure our library has some safeguard to clean up against network hiccups.

Regarding your concern about the failure scenario where a client sends a mutation to the server, it executes, and yet the client fails to receive a reply (say, shuttered connection): this is essentially impossible to guarantee. The solutions to this scenario are either 1) for some sort of async worker queue (i.e. mutation to server enques a job with a UUID of the client's choosing, and then the client can poll for updates on that enqueued job); or 2) to make the mutation requests replay-safe (i.e. "set value to X" as opposed to "increment value by 1"); or 3) some sort of only-if-modified-match (i.e. "increment value by 1 if last-modified-time="). Each of these setups has their pros-and-cons. We're keeping our systems simple (a single read-write point in Postgres), we don't have any (current) needs that get into any sort of fancy distributed data stores. Because of this, probably easiest for us as API designers to make sure we don't expose API calls that are "increment by 1"; and when there is a call that a replay would be dangerous to provide a only-if-modified-equals parameter. Nutshell: it's really hard to guarantee a response to a mutation is received in a single-transaction setup (see https://en.wikipedia.org/wiki/Two-phase_commit_protocol for rabbit-hole material).

jebena / jebena-python-client

Request timeout hard-coded. #5