Closed c0c0n3 closed 3 years ago
Hi @c0c0n3 , if no one is working on this then please assign this issue to me. Thanks
Hie @c0c0n3 , I am trying to reproduce this issue and when CrateDB is not available it retries with Warning to connect to it and give 500 response
As per my understanding If crateDB is not available it will retry to connect to DB until connection is available. Please confirm my understanding. Thanks :)
@daminichopra thanks for this. Looking at the trace you posted I suppose you're right. It looks like the Crate HTTP client already implements retries through urllib3
. I could see no timestamp in the trace, can you check how far apart are those retries?
Hi @c0c0n3 , the timeout retries is set to socket._GLOBAL_DEFAULT_TIMEOUT which has value None and value of None indicates that new socket objects have infinite timeout. Please check below URL's for the same. https://github.com/urllib3/urllib3/blob/4903840bf36a05bcc8299f6553ff7a1816d4aa63/src/urllib3/connection.py#L236 https://github.com/urllib3/urllib3/blob/d369183f60749d735f0676efb22ecd64fc694768/src/urllib3/util/timeout.py#L93 https://www.cmi.ac.in/~madhavan/courses/prog2-2015/docs/python-3.4.2-docs-html/library/socket.html https://kite.com/python/docs/urllib3.Timeout
Hi @daminichopra, thanks for the detailed analysis! :-)
Going to collect all we know about the issue at this stage down below.
urllib3
under the bonnet to connect to Crate.urllib3
waits indefinitely to establish a TCP connection with Crate---I think this refers to the 3-way TCP handshake, if the server process isn't listening to the port the lib returns immediately with an error.urllib3
waits indefinitely between reading two TCP packets.(5) and (6) means a slow server in the pool would slow the client too since the client wouldn't try connecting to the next server in the HA pool. While this will probably require some tuning in the future, I'd leave (5) and (6) out for now since it doesn't look like those timeout settings could possibly have caused those 500s we experienced in prod.
Regarding (3) though, do you know how far apart are those retries? e.g.
Looking at the trace you posted earlier, this should happen in the Crate client?
Hi @c0c0n3 , I have investigated further on this. Initially server retries to connect to client (total 10 times) which take 12 milliseconds (approx).
The time between the retries is controlled by backoff_factor, which is used for retry interval between attempt of consecutive next try and so on. The location of this variable is kwargs['retries'] = Retry(read=0, backoff_factor = 0.1)
By default, backoff_factor is disabled (set to 0).
If i modify the value of backoff_factor = 0.1 then below is are the observation of retries interval:
If i modify the value of backoff_factor = 0.2 then below is are the observation of retries interval:
Please refer below link: https://github.com/urllib3/urllib3/blob/a9776d15013a7a4f2b92e4d7d1be2b5fe18d43d4/src/urllib3/util/retry.py#L114 https://github.com/urllib3/urllib3/blob/a9776d15013a7a4f2b92e4d7d1be2b5fe18d43d4/src/urllib3/util/retry.py#L170
Hi @daminichopra this is great! Excellent investigation, you're a star. I suppose then all we need to do for this issue is tweaking the backoff param. Do you think we could make this configurable, e.g. through an env variable so that different QL deployments could specify different backoff values?
Had a quick look at the code base, we've got util classes to read env vars---see util/cfgreader
. Perhaps you could do something similar to PostgresConnectionData
(in translators/timescale
) which uses EnvReader
from util/cfgreader
.
Hi @c0c0n3 , As per the flow of the code it need to add parameter "backoff_parameter" in crate code. When QL try to make connection to server at location: https://github.com/smartsdk/ngsi-timeseries-api/blob/dc565af24b303a94f7c298b2567e62487088de3b/src/translators/crate.py#L64 After that it call methods of http.py (which is part of crate library) where no parameter is defined for 'backoff_factor'. Please refer below links for same: https://github.com/crate/crate-python/blob/a2794ac0a7ce358b576eae0b0629e9e974e5472d/src/crate/client/http.py#L397
If it is defined as kwargs backoff_factor in kwargs['retries'] = Retry(read=0)
then only QL will wait for certain time between retries and for that crate code needs to be changed.
https://github.com/crate/crate-python/blob/a2794ac0a7ce358b576eae0b0629e9e974e5472d/src/crate/client/http.py#L397
So, as per my understanding, we need to change code of Crate library to implement this. So, we need to raise issue in crate github community to implement it in QL code.
for that crate code needs to be changed
Sorry I missed that. Well, then I suppose there's nothing more we can do about this issue, unless you have other ideas?
Thank you a stack for the great sleuthing!
Hi @c0c0n3 , As per my understanding i suggest that we can raise this Issue in CrateDB community and can fix it there. If it looks OK then Please confirm. Thank you :)
Hi @daminichopra, yes agreed, good idea. Then I'm going to keep this issue open so when a fix becomes available in Crate we can implement configuration in our translator. If you could be so kind to add a link to the Crate issue. Thank you so much!
@daminichopra super duper, thank you soooo much! see also #256 and PR #300
Stale issue message
@chicco785 @chicco785, Please close this issue as PR is merged and closed. Thank you
Is your feature request related to a problem? Please describe. On receiving a notification, if the DB isn't immediately available, QL bombs out since we can't do the insert.
Describe the solution you'd like Implement retries when the DB isn't available (e.g. overloaded) or we can't connect to it.
Describe alternatives you've considered Use message queues as a back-pressure mechanism to safely buffer data and implement retries.
Additional context We get quite a number of
500
s from Crate when trying to insert data under heavy load.