[Blueprint] Concent state machine

cameel commented 6 years ago

Concent states

Concent server can be thought of as a state machine with regard to any specific subtask. Upon receiving a message in the send/ endpoint the processing depends only the current state and the content of the message submitted by the client.

Here's a diagram that shows all possible Concent states and transitions between them.

concent-states-detailed

Note that the states we're talking about here are Concent states, not subtask/task states. This means that the state reflects Concent's knowledge about the subtask and not necessarily what is happening to it between the provider and requestor. For example if the provider triggers the 'force report computed task' use case, requestor acknowledges it and none of them interact with Concent after that, the subtask stays in the REPORTED state even though it may have already been accepted or rejected.

To reflect this we can classify the states into two categories:

active states (blue) - states in which the subtask can remain only for a limited time. Concent always eventually exits such a state. Either in response to a message from a client or after a timeout.
passive states (yellow) - states that indicate that Concent is not processing any use case for the subtask. A subtask can remain in such a state indefinitely.

Another thing of note is that there are no states for 'forced payment' use case. This use case operates on multiple subtasks and the response is returned immediately so there's no state change here.

Subtask table

To keep track of subtask states Concent uses database tables with the following structure:

`Subtasks`

Column name	Type	Remarks
`task_id`	string	All (`requestor_public_key`, `task_id`) pairs must be globally unique. Must not be `NULL` or blank. A task can have multiple subtasks. Concent always operates on individual subtasks separately. The task ID is kept only for reference.
`subtask_id`	string	All (`requestor_public_key`, `subtask_id`) pairs must be globally unique. Must not be `NULL` or blank.
`requestor`	foreign key to `Client`	Must not be `NULL`
`provider`	foreign key to `Client`	Must not be `NULL`
`state`	enum (see state diagram)	The current Concent state. There is no enumeration value for `UNKNOWN` because it represents a state in which Concent does not know about subtask's existence.
`next_deadline`	timestamp	If in an active state, it's the time at which Concent automatically transitions to a passive state if it does not get the information it expects from the client (a timeout). Must be `NULL` in a passive state.
`computation_deadline`	timestamp	Same as `task_to_compute.deadline`. Must not be `NULL`.
`result_package_size`	int	Same as `report_computed_task.package_size`. `NULL` if and only if `report_computed_task` is `NULL`.
`task_to_compute`	foreign key to `StoredTaskToCompute`	Must not be `NULL`. A subtask always has a `TaskToCompute` associated with it.
`report_computed_task`	foreign key to `StoredReportComputedTask`	Computation report created by the provider. Must not be `NULL`.
`ack_report_computed_task`	foreign key to `StoredAckReportComputedTask`	Subtask acknowledgement to be passed to the provider, either submitted by the requestor or generated by Concent. May be `NULL` if not submitted or generated yet.
`reject_report_computed_task`	foreign key to `StoredRejectReportComputedTask`	Subtask report rejection to be passed to the provider, either submitted by the requestor or generated by Concent. May be `NULL` if not submitted or generated yet.
`subtask_results_accepted`	foreign key to `StoredSubtaskResultsAccepted`	Subtask acceptance to be passed to the provider, either submitted by the requestor or generated by Concent. May be `NULL` if not submitted or generated yet.
`subtask_results_rejected`	foreign key to `StoredSubtaskResultsRejected`	Subtask rejection to be passed to the provider, either submitted by the requestor or generated by Concent. May be `NULL` if not submitted or generated yet.

`Clients`

Column name	Type	Remarks
`public_key`	string (base64-encoded public key)	Must not be `NULL` or blank.

StoredTaskToCompute, StoredReportComputedTask, StoredAckReportComputedTask, StoredRejectReportComputedTask, StoredSubtaskResultsAccepted, StoredSubtaskResultsRejected are all tables storing the content of the corresponding Golem messages. Each table has columns for timestamp, signature and all payload fields. Nested data is represented in a relational way (e.g. nested messages should be foreign keys to the corresponding tables). We want to be able to easily and efficiently use message data in queries. We don't want to have to fetch and decode each message individually. We also want to strictly enforce the schema so that we can easily detect changes and provide data migrations for old data.

State transitions

State transitions can happen in two ways:

In response to a message submitted to the send/ endpoint.
Automatically, after a timeout.

Automatic transitions require Concent to actively check subtask states in the background and take actions when the time is up. Note however that even if Concent notices a timeout, the client won't until it receives a response. Due to the nature of the current implementation (Concent is passive and clients query the receive/ endpoint to get responses) it's possible to achieve the same effect (from client's standpoint) by postponing timeout checks until the client actually asks. That's the approach we're taking for now.

So this means that automatic transitions can happen whenever the client contacts one of the three message-related endpoints: send/, receive/ or receive-out-of-band/. Before doing any processing in these endpoints Concent runs a function that checks timeouts for all subtasks in which the client is involved, updates the subtask table and adds responses to the receive queues.

When a timeout is detected Concent may have to perform additional steps to determine the outcome. For example it may have to check if all necessary files have been uploaded. The response added to the queue can depend on the result but once it's in the queue it won't be changed or be automatically discarded.

Receive queues

Each client has two queues associated with it:

receive queue
receive-out-of-band queue

Each of them stores information about messages to be returned from the corresponding endpoint. Messages are generated from that data on the fly so that the timestamp is always current. This applies only to the outer messages generated by Concent. If there are nested messages submitted from the outside, their timestamps and signatures are guaranteed to be preserved.

Every queue is associated with a single client and can contain messages associated with any of its subtasks.

Messages are added to the queue as a result of state transitions (see above). Once queued they do not undergo any further processing. They just sit in the queue until the authorized client fetches them.

Messages are returned in the order of their timestamps (oldest first). Note that, due to the nature of the HTTP protocol, if a client fetches multiple messages in parallel the order in which they reach it is not in any way guaranteed even if the server returns them in a specific order.

Messages are never discarded from the queue, even after state transitions. It's the responsibility of the client to decide whether a message is still relevant or not.

A message is returned only once and then marked as delivered. Delivered messages may remain stored on the server for use in further use cases or for audit purposes but there's no way for the client to fetch them again.

Queue internals

The queues are implemented using a single database table with locks to ensure that the process of retrieving a message is atomic.

The structure of the table is as follows:

`ReceiveQueue`

Column name	Type	Remarks
`response_type`	`ConcentResponseType` enum
`client`	foreign key to `Client`
`queue`	enum: `receive` or `receive-out-of-band`
`delivered`	bool	`TRUE` if the client has already fetched the message.
`subtask`	foreign key to `Subtask`
`created_at`	timestamp	The time at which the message has been added to the queue. This is not the timestamp of the final message. The timestamp should be the same as the time at which the Golem message object is created, just before delivery.

All the columns must not be NULL.

`ConcentResponseType` enumeration

ForceReportComputedTask
ForceReportComputedTaskResponse
VerdictReportComputedTask
ForceGetTaskResultFailed
ForceGetTaskResultUpload
ForgetGetTaskResultDownload
ForceSubtaskResults
SubtaskResultsSettled
ForceSubtaskResultsResponse
SubtaskResultsRejected
ForcePaymentCommitted

Note that this list contains only those messages that can be added to the receive queue, not all possible messages. Messages that are always returned synchronously from the send/ endpoint are not included here.

The type and the data stored in the Subtask table is everything that's needed to create a response. For example if the response is a wrapper for the actual response passed from the requestor, the link to requestor's response is stored in the Subtask table. If the content of the response depends on the result of some external checks done after a timeout, this result should be stored in Subtasks. The Subtasks table (along with the related tables) is the central place to store everything we need to create a complete response. The information should be stored in a relational form - in database columns - rather than inside serialized Golem message objects.

Message storage

Eventually every Golem message handled by Concent should have a corresponding database table. In short term only the most often used messages will get their own tables and the rarely used ones will still be stored as binary blobs in the StoredMessage table.

Message tables are meant to store all messages no matter what their purpose is. Other mechanisms like subtask table, receive queues or audit storage can build on that by referencing them with foreign keys. None of them should make an assumption that stored messages are for their exclusive use.

rwrzesien commented 6 years ago

@cameel Few quick questions after first reading:

Currently when message is immediately rejected and there is a rejection message send back to the client in same HTTP request we don't store this message in any way. Should this now be stored ?
I think that adding column for a message type in Subtask table (ie. report_computed_task as foreign key to StoredReportComputedTask) is not scalable and normalized. Instead of that we may add subtask_id column on StoredReportComputedTask table etc (actually to their abstract).
"Another thing of note is that there are no states for 'forced payment' use case. This use case operates on multiple subtasks and the response is returned immediately so there's no state change here." - just a note, in full flow it returns two messages, first for provider that is returned as response for request, second is waiting for requestor in receive-out-of-band. But I don't think that changes anything in this blueprint.

cameel commented 6 years ago

Currently when message is immediately rejected and there is a rejection message send back to the client in same HTTP request we don't store this message in any way. Should this now be stored ?

Eventually yes. Right now - maybe.

In the code as implemented currently we're storing them but that's because the table we have was meant to be the receive queue and contain everything sent to the client (marked as delivered). As you see above, the design changed a bit - now the receive queue is not storing actual messages but the information needed to create a messages when a client asks for it. It also only covers messages returned from receive/ and not all possible messages. So we're not storing them in the receive queue any more.

We do want to store them for audit purposes (and maybe for debugging). We'll eventually want to add all sent and received messages to StoredMessage (or a table for specific mesage type if one exists) and have another table with foreign keys to them that says who received/sent it and maybe some other metadata.

We don't have such a table right now and there's no task to do it yet (there's only #47 which is only about logging) so right now you could store them in StoredMessage if you think this will help us with debugging when something goes wrong in production but it's not absolutely required.

I think that adding column for a message type in Subtask table (ie. report_computed_task as foreign key to StoredReportComputedTask) is not scalable and normalized. Instead of that we may add subtask_id column on StoredReportComputedTask table etc (actually to their abstract).

Each task is associated with a small and strictly defined number of messages that have a special meaning. One TaskToCompute, one ReportComputedTask, one or zero AckReportComputedTask and so on. Say, for example, the requestor sends us two different AckReportComputedTask instances. Concent obviously rejects the second one but we're still going to want to store it for audit purposes. The subtask table needs to indicate which one is the one that was accepted by Concent. That's what this column is for. It's not saying "this message is associated with this task". It says "this message is the one valid and accepted AckReportComputedTask that will be resent to the provider and used in all further use cases".

Also note that we're not going to add a column for every message associated with the subtask. Some messages are more important than others. The columns I added there are not just examples. These are the only messages that a subtask needs to know about. So basically TaskToCompute and ReportComputedTask (because they're used for auth and to determine deadlines) plus the confirmations/rejections that we need to pass between the provider and requestor. That's all.

It's not a scalability problem because it's not a replacement for the subtask_id column in stored messages (which should be be added there too). For the vast majority subtask_id is enough. For some specific messages we have to choose one of many and that's why we have those 1:1 columns (they should be OneToOneField of course).

And to be completely clear - I see StoredMessage (and the other message-specific tables like StoredTaskToCompute) as a dumb message table where the message object does not know why it's being stored. The mutated super subtask-receive-queue table that we have now needs to go. A dumb message storage makes the system more extensible because this way we can tack on various independent special-purpose tables (subtask table, receive queue, audit table) that just refer to already stored messages and give them meaning.

"Another thing of note is that there are no states for 'forced payment' use case. This use case operates on multiple subtasks and the response is returned immediately so there's no state change here." - just a note, in full flow it returns two messages, first for provider that is returned as response for request, second is waiting for requestor in receive-out-of-band. But I don't think that changes anything in this blueprint.

Yeah, the blueprint skims over the payments because they don't need a separate state. I assume that payment requests can come at any moment and are processed immediately. Concent just responds synchronously and maybe adds messages to the receive queue. It does not need to do any long calculations or wait for any client to send or upload anything so there's no state associated with that.

One change is needed though. ForcePaymentCommitted needs to be added to the list of message types supported by the receive queue. I'm adding it to the description. This list is meant to be exhaustive. The receive queue should not support messages other than those listed there.

BTW: if there are any fields that need to be added or changed in the tables please consult with me first. I think the ones listed here should be enough but if you find something that's missing please let me know before you implement it.

cameel commented 6 years ago

One more thing. Remember that subtask_id in the message table(s) cannot be a foreign key. We can always receive a message for which a subtask does not exist and we may still want to store that message.

rwrzesien commented 6 years ago

@cameel Ad 1) I think we are talking about two different things here. What I have meant is ie. this piece of code:

    if client_message.report_computed_task.task_to_compute.compute_task_def['deadline'] < current_time:
        logging.log_timeout(
            client_message,
            request.META['HTTP_CONCENT_CLIENT_PUBLIC_KEY'],
            client_message.report_computed_task.task_to_compute.compute_task_def['deadline'],
        )
        reject_force_report_computed_task                 = message.RejectReportComputedTask(
            header = MessageHeader(
                type_     = message.RejectReportComputedTask.TYPE,
                timestamp = client_message.timestamp,
                encrypted = False,
            )
        )
        reject_force_report_computed_task.reason          = message.RejectReportComputedTask.REASON.TaskTimeLimitExceeded
        reject_force_report_computed_task.task_to_compute = client_message.report_computed_task.task_to_compute
        return message.concents.ForceReportComputedTaskResponse(
            reject_report_computed_task = reject_force_report_computed_task
        )

This happens on communication start, so there is no Subtask created yet. And we currently don't store anything when this flow happens. So the question is - should we store it now, like in this example, create Subtask object and related StoredRejectReportComputedTask object ?

Jakub89 commented 6 years ago

@cameel In Subtask Table, You've said that report_computed_task allow Null value if Subtask is in "FORCING REPORT" State.

cameel commented 6 years ago

State diagram updated.

Timeout for the FORCING ACCEPTANCE state was missing
There were two ADDITIONAL VERIFICATION states. The one above should have been called VERIFICATION FILE TRANSFER

Here's the old diagram for reference: concent-states-detailed

cameel commented 6 years ago

@cameel In Subtask Table, You've said that report_computed_task allow Null value if Subtask is in "FORCING REPORT" State.

Yes. It can be NULL in the FORCING REPORT state. Description updated.

cameel commented 6 years ago

State diagram updated. I have added a missing state transition for the case when Concent overrules requestor's decision in the FORCING REPORT state.

cameel commented 6 years ago

@rwrzesien

Ad 1) I think we are talking about two different things here.

No, we're not. I was referring specifically to this. I originally wanted everything to be stored in the queue. Including the messages created in send/. But at the beginning there were no synchronous responses so it was satisfied in a trivial way - there just weren't any ;) Later they should have been stored but this just was not changed because the implementation went its own way.

This happens on communication start, so there is no Subtask created yet. And we currently don't store anything when this flow happens. So the question is - should we store it now, like in this example, create Subtask object and related StoredRejectReportComputedTask object ?

As I said, the design has changed. Now we do not want to store them in the queue. Queue is only for stuff returned from receive/. But we may want to store them for audit or debugging. That's a separate thing. Those messages whould be just inserted into the message table without anything linking to them. You may do this right now if you find it useful. If not, we'll probably do it at some point.

Jakub89 commented 6 years ago

@cameel I think we made a mistake with report_computed_task being Null in FORCING_REPORT state. Looks like it should be Null in FORCING_ACCEPTANCE instead. At least that's the case in golem-messages v1.12.

cameel commented 6 years ago

No, you can't force the requestor to accept the task if you have not finished it. And if you finished it you must have the report to prove it.

cameel commented 6 years ago

Actually now I see why I initially said it must not be NULL. I confused it with the ack. We don't have the ack yet in the FORCING REPORT phase but we do have the report and it must not be NULL. I'm changing the spec back.

dybi commented 6 years ago

@cameel, please update the Subtask table (due to: https://github.com/golemfactory/concent/issues/329):

dedline (int)
result_package_size (int, nullable) from ReportComputedTask

cameel commented 6 years ago

Updated, but with some alterations:

computation_deadline instead of deadline to prevent confusion with next_deadline. Otherwise one might think that it's the "current" (as opposed to the "next") deadline which is not true. It's just a deadline for the computation phase while next_deadline is the deadline of the current state. Actually, now that I think about it, we would be better off calling next_deadline state_deadline but it's too late to change it now.
result_package_size can be NULL if and only if report_computed_task is NULL. Don't forget to add a validation for that in the model. In views they're not be necessary though because it's an invariant that should never be violated as long as the code is corrrect.

cameel commented 6 years ago

State diagram updated after applying suggestions from #432:

Conditions on arrows no longer overlap.
Time constraints updated to match #165.
Added explanation of notation used on the diagram.
The arrow between VERIFICATION FILE TRANSFER and ADDITIONAL VERIFICATION states is now green.
Fixed the name of ForceSubtaskResultsResponse (was SubtaskResultsResponse).
Added missing conditions for SubtaskResultsVerify.

cameel commented 6 years ago

Another update of state diagram. Only small tweaks. See https://github.com/golemfactory/concent/pull/432#issuecomment-399028222

wyspiansky87 commented 6 years ago

Please add computation_deadline and result_package_size to Subtask

cameel commented 6 years ago

Update: ForceGetTaskResultRejected is no longer a valid value in ConcentResponseType. It's not needed and should have never been there. Bug reported by @rwrzesien in #823.

kbeker commented 5 years ago

@cameel Subtask model now has additional field protocol_version which contains version of messages related to given subtask.

golemfactory / concent