golemfactory / concent

Repository for Concent Service sources
8 stars 7 forks source link

SCI Callback to Middleman connection retry #850

Closed PaweuB closed 5 years ago

PaweuB commented 6 years ago

First, investigate what happens when gunicorn terminates the request due to a timeout. I.e. does it kill the Django process or does it let it finish but just terminates the connection and does not send the response to the client. After doing this, please add comment in this issue with the result.

Then add timeout to SCI callback when sending request to Middleman defined as new Django setting.

In case of timeout, the operation should be retried SCI_CALLBACK_RETRIES times, where SCI_CALLBACK_RETRIES is a Django setting. Setting it to zero means there should be no retry. Setting it to one or more means that whole operation should be retried that amount of times in case of exception. Off course, the retry should also include timeout.

If the timeout is still reached in last retry, raise custom exception.

Please remember to log info that the call is retried. Please remember that only sending request to Middleman is counted in timeout and only this should be retried.

PaweuB commented 6 years ago

Pivotal: https://www.pivotaltracker.com/story/show/160754318

cameel commented 6 years ago

Please don't create "placeholder" issues. We have Pivotal for that. Github is for when the issue needs a longer description and only when that description can already be provided. If the only thing we know is that that a task needs to be done, but we're yet to decide when or how, adding it in Pivotal is all we need.

rwrzesien commented 5 years ago

@cameel Should reaching SigningService and increasing its daily threshold info be counted as "irreversible operation"?

cameel commented 5 years ago

@rwrzesien 1) > In case of those exceptions, the operation should be retried SCI_CALLBACK_RETRIES times

I think that having a timeout would be better. I.e. keep retrying until the time runs out. This is because we have timeout on requests and after that time the request is killed. It would be better to be able to set the timeout to request timeout minus one second to avoid having the whole request killed (which results in client not even receiving a response).

2) If I remember correctly, when we talked about it, we agreed that you'll investigate what happens when gunicorn terminates the request due to a timeout. I.e. does it kill the Django process or does it let it finish but just terminates the connection and does not send the response to the client?

This should be included here too.

3) > Should reaching SigningService and increasing its daily threshold info be counted as "irreversible operation"?

Yes. Anything that contacts a remote service that's not read-only should be considered irreversible.

In this particular situation I'm more worried about middleman's internal state or extensions we might add in SS than the counter. Increasing the counter and then not actually making the payment is not a big deal. The limit is not an exact value anyway. But I think communication with SS should be treated as irreversible because it might start doing other things.
rwrzesien commented 5 years ago

@cameel 2.

If I remember correctly, when we talked about it, we agreed that you'll investigate what happens when gunicorn terminates the request due to a timeout. I.e. does it kill the Django process or does it let it finish but just terminates the connection and does not send the response to the client?

I have added to issue description that this should be investigated first, and then depending on the outcome of this investigation the rest of the issue might be redefined. Is that ok?

3.

In this particular situation I'm more worried about middleman's internal state or extensions we might add in SS than the counter. Increasing the counter and then not actually making the payment is not a big deal. The limit is not an exact value anyway. But I think communication with SS should be treated as irreversible because it might start doing other things.

But we are not able to determine if request reached SigningService or not. This means that every request sent by Callback should be treated as irreversible, so we should never make retry.

cameel commented 5 years ago

I have added to issue description that this should be investigated first, and then depending on the outcome of this investigation the rest of the issue might be redefined. Is that ok?

I don't think we'll need to redefine it based on the outcome. We should retry either way. If we discover behavior that could break stuff we were not aware of until now, that's a separate concern. Create separate issues for that.

But we are not able to determine if request reached SigningService or not. This means that every request sent by Callback should be treated as irreversible, so we should never make retry.

You're right. We cannot treat it that way. Concent should treat SS as (mostly) stateless because otherwise it would not be safe to retry. So let's say that operations are not reversible but since from Concent's point of view they leave no trace, there's no need to reverse them.

rwrzesien commented 5 years ago

@cameel Description updated, I have left the possibility to define number of retries too, it should not be difficult but will give extra option to increase/decrease retries through settings.

cameel commented 5 years ago

OK. Having both options is fine as long as you can decide not to use one of them by setting None.

cameel commented 5 years ago

Then add timeout to SCI callback when sending request to Middleman. The timeout value should be calculated like timeout_on_requests - 1 (check with deployment team if the value is available in Django code, if not, we might think of a way to use the same value in both places without need to update it in both places).

I'd just create a simple setting for the timeout. The fact that we prefer to set it to request timeout - 1 is up to the admin. We'll set it to that in our deployment configs. It should still be possible to set a different timeout. I don't think we should try to do magic like inspecting gunicorn options.

I'm adding sync with deployment label because after implementing it you should let @bartoszbetka know that he has to provide values for the new settings.

rwrzesien commented 5 years ago

Updated description on timeout as new Django setting.

dybi commented 5 years ago

@cameel, @rwrzesien I still see the SCI_CALLBACK_RETRIES in the main description and still don't see the timeout definition? Has the main description been actually updated? ;)

rwrzesien commented 5 years ago

@dybi

I still see the SCI_CALLBACK_RETRIES

As discussed above, we agreed to keep both.

still don't see the timeout definition

It is already there: "Then add timeout to SCI callback when sending request to Middleman defined as new Django setting."

The setting name is up to implementer.

Jakub89 commented 5 years ago

@cameel @rwrzesien

Gunicorn just restart concent server with:

[2019-01-15 17:28:13 +0100] [20935] [CRITICAL] WORKER TIMEOUT (pid:20938)
[2019-01-15 16:28:13 +0000] [20938] [INFO] Worker exiting (pid: 20938)
[2019-01-15 17:28:13 +0100] [20976] [INFO] Booting worker with pid: 20976

So the rest of the code after time.sleep() was not touched. From the client side, we get:

ERROR: Failed connect to the server.

('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
cameel commented 5 years ago

@Jakub89 You did not reveal a key piece of information. You determined that the sleep() instruction gets interrupted by SystemExit(1). That's the mechanism that gunicorn uses to kill the worker.

So it probably just sends it SIGTERM. This means that it's not a "hard" kill. When you caught the exception the worker did not exit. This means that finally blocks and any other cleanup mechanisms will work just fine.

So we simply need to ensure that we're prepared for SystemExit anywhere inside the callback and we will leave the database in a consistent state if that happens.

Jakub89 commented 5 years ago

I've implemented this task a little different then issue suggested: PR

In case of timeout, the operation should be retried SCI_CALLBACK_RETRIES times, where SCI_CALLBACK_RETRIES is a Django setting. Setting it to zero means there should be no retry. Setting it to one or more means that whole operation should be retried that amount of times in case of exception. Off course, the retry should also include timeout.

The way @dybi explain it to me, is that we already have a timeout in core/constants.py in form of:

# Defines how many seconds should SCI callback wait for response from MiddleMan.
SCI_CALLBACK_MAXIMUM_TIMEOUT = 30

And now, depending on how long SCI_CALLBACK_RETRIES_TIME will be, that how many times retry will happen, or in other words how many times SCI_CALLBACK_MAXIMUM_TIMEOUT will fit inside SCI_CALLBACK_RETRIES_TIME. So depending on those two time settings we can manipulate amount of retries. We should probably lower SCI_CALLBACK_MAXIMUM_TIMEOUT to something like 4-5 seconds, and have SCI_CALLBACK_RETRIES_TIME=30, which give 6-8 retries. What do You guys think?