Retry DB inserts when DB isn't available

c0c0n3 commented 5 years ago

Is your feature request related to a problem? Please describe. On receiving a notification, if the DB isn't immediately available, QL bombs out since we can't do the insert.

Describe the solution you'd like Implement retries when the DB isn't available (e.g. overloaded) or we can't connect to it.

Describe alternatives you've considered Use message queues as a back-pressure mechanism to safely buffer data and implement retries.

Additional context We get quite a number of 500s from Crate when trying to insert data under heavy load.

daminichopra commented 4 years ago

Hi @c0c0n3 , if no one is working on this then please assign this issue to me. Thanks

daminichopra commented 4 years ago

Hie @c0c0n3 , I am trying to reproduce this issue and when CrateDB is not available it retries with Warning to connect to it and give 500 response Capture Capture

As per my understanding If crateDB is not available it will retry to connect to DB until connection is available. Please confirm my understanding. Thanks :)

c0c0n3 commented 4 years ago

@daminichopra thanks for this. Looking at the trace you posted I suppose you're right. It looks like the Crate HTTP client already implements retries through urllib3. I could see no timestamp in the trace, can you check how far apart are those retries?

daminichopra commented 4 years ago

Hi @c0c0n3 , the timeout retries is set to socket._GLOBAL_DEFAULT_TIMEOUT which has value None and value of None indicates that new socket objects have infinite timeout. Please check below URL's for the same. https://github.com/urllib3/urllib3/blob/4903840bf36a05bcc8299f6553ff7a1816d4aa63/src/urllib3/connection.py#L236 https://github.com/urllib3/urllib3/blob/d369183f60749d735f0676efb22ecd64fc694768/src/urllib3/util/timeout.py#L93 https://www.cmi.ac.in/~madhavan/courses/prog2-2015/docs/python-3.4.2-docs-html/library/socket.html https://kite.com/python/docs/urllib3.Timeout

c0c0n3 commented 4 years ago

Hi @daminichopra, thanks for the detailed analysis! :-)

Going to collect all we know about the issue at this stage down below.

Crate client

The client keeps a pool of Crate servers.
Each server in the pool gets a chance to establish a connection.
On failure, the client retries 10 times per server.

Connection & socket timeouts

The Crate client usesurllib3 under the bonnet to connect to Crate.
By default urllib3 waits indefinitely to establish a TCP connection with Crate---I think this refers to the 3-way TCP handshake, if the server process isn't listening to the port the lib returns immediately with an error.
By default urllib3 waits indefinitely between reading two TCP packets.

Mitigating Crate DB unavailability

(5) and (6) means a slow server in the pool would slow the client too since the client wouldn't try connecting to the next server in the HA pool. While this will probably require some tuning in the future, I'd leave (5) and (6) out for now since it doesn't look like those timeout settings could possibly have caused those 500s we experienced in prod.

Regarding (3) though, do you know how far apart are those retries? e.g.

client first attempt to connect fails
wait 100ms
client second attempt to connect fails
wait 200ms
...

Looking at the trace you posted earlier, this should happen in the Crate client?

daminichopra commented 4 years ago

Hi @c0c0n3 , I have investigated further on this. Initially server retries to connect to client (total 10 times) which take 12 milliseconds (approx). Capture

Crate Client

The time between the retries is controlled by backoff_factor, which is used for retry interval between attempt of consecutive next try and so on. The location of this variable is kwargs['retries'] = Retry(read=0, backoff_factor = 0.1)
By default, backoff_factor is disabled (set to 0).
If i modify the value of backoff_factor = 0.1 then below is are the observation of retries interval:
- client first attempt to connect fails
- wait 0.0
- client second attempt to connect fails
- wait 0.2
- client third attempt to connect fails
- wait 0.4
If i modify the value of backoff_factor = 0.2 then below is are the observation of retries interval:
- client first attempt to connect fails
- wait 0.0
- client second attempt to connect fails
- wait 0.3
- client third attempt to connect fails
- wait 0.6 ...

Please refer below link: https://github.com/urllib3/urllib3/blob/a9776d15013a7a4f2b92e4d7d1be2b5fe18d43d4/src/urllib3/util/retry.py#L114 https://github.com/urllib3/urllib3/blob/a9776d15013a7a4f2b92e4d7d1be2b5fe18d43d4/src/urllib3/util/retry.py#L170

The Maximum value of backoff factor which can be set as 120 .

c0c0n3 commented 4 years ago

Hi @daminichopra this is great! Excellent investigation, you're a star. I suppose then all we need to do for this issue is tweaking the backoff param. Do you think we could make this configurable, e.g. through an env variable so that different QL deployments could specify different backoff values?

c0c0n3 commented 4 years ago

Had a quick look at the code base, we've got util classes to read env vars---see util/cfgreader. Perhaps you could do something similar to PostgresConnectionData (in translators/timescale) which uses EnvReader from util/cfgreader.

daminichopra commented 4 years ago

Hi @c0c0n3 , As per the flow of the code it need to add parameter "backoff_parameter" in crate code. When QL try to make connection to server at location: https://github.com/smartsdk/ngsi-timeseries-api/blob/dc565af24b303a94f7c298b2567e62487088de3b/src/translators/crate.py#L64 After that it call methods of http.py (which is part of crate library) where no parameter is defined for 'backoff_factor'. Please refer below links for same: https://github.com/crate/crate-python/blob/a2794ac0a7ce358b576eae0b0629e9e974e5472d/src/crate/client/http.py#L397

https://github.com/crate/crate-python/blob/a2794ac0a7ce358b576eae0b0629e9e974e5472d/src/crate/client/http.py#L95

If it is defined as kwargs backoff_factor in kwargs['retries'] = Retry(read=0) then only QL will wait for certain time between retries and for that crate code needs to be changed. https://github.com/crate/crate-python/blob/a2794ac0a7ce358b576eae0b0629e9e974e5472d/src/crate/client/http.py#L397

So, as per my understanding, we need to change code of Crate library to implement this. So, we need to raise issue in crate github community to implement it in QL code.

c0c0n3 commented 4 years ago

for that crate code needs to be changed

Sorry I missed that. Well, then I suppose there's nothing more we can do about this issue, unless you have other ideas?

Thank you a stack for the great sleuthing!

daminichopra commented 4 years ago

Hi @c0c0n3 , As per my understanding i suggest that we can raise this Issue in CrateDB community and can fix it there. If it looks OK then Please confirm. Thank you :)

c0c0n3 commented 4 years ago

Hi @daminichopra, yes agreed, good idea. Then I'm going to keep this issue open so when a fix becomes available in Crate we can implement configuration in our translator. If you could be so kind to add a link to the Crate issue. Thank you so much!

daminichopra commented 4 years ago

Hi @c0c0n3 , I have fixed the code for above issue in cratedb community which is merged now in their code but it is updated for version 4.0.2 and QL is using version 3.1.2 , so when version 4.0 of CrateDB will be included in QL then i'll raise the PR for issue #242 Thank you :)

c0c0n3 commented 4 years ago

@daminichopra super duper, thank you soooo much! see also #256 and PR #300

github-actions[bot] commented 3 years ago

Stale issue message

daminichopra commented 3 years ago

@chicco785 @chicco785, Please close this issue as PR is merged and closed. Thank you

orchestracities / ngsi-timeseries-api