golemfactory / concent

Repository for Concent Service sources
8 stars 7 forks source link

[Blueprint] Concent Signing Service #597

Open cameel opened 6 years ago

cameel commented 6 years ago

Concent Signing Service

Components

signing-service-components

Wire protocol

Signing Service, Middleman and Concent communicate by sending data over TCP connections.

TCP is a stream-oriented protocol which means that the client simply receives a stream of bytes and has to interpret it as messages on its own. We're going to build a message-oriented protocol on top of it. We need the following properties:

  1. Framing: we need to know when one message ends and another starts.
  2. Error detection: we need to be able to detect malformed data.
  3. Error recovery: one malformed message should not disrupt all the others that come after it.
  4. Matching respones to requests: when one side skips a message the other should detect it and not pair a request with the wrong response.
  5. Payload: we need to be able to serialize our application-level data and put it in a data frame.
  6. Security: the data does not need to be secret but any modification in transit should be detected.

One important assumption that greatly simplifies things is that the protocol is not meant to connect random parties. Both parties are expected to have a key pair and to know each other's public keys.

Data frames

To get framing we'll use separators. A separator will be an arbitrary, unique and constant string of bytes. Encountering it means that the current message has just ended - even if it's incomplete - and a new one has started. The data inside the frame should be escaped so that if it happens to contain the string of bytes we use as a separator, the frame does not end prematurely.

For the ease of implementation, it's best to use a single character as a separator. Escaping with a multi-byte sequence has many corner cases and may be hard to implement correctly.

We're using separators rather than a field with the overall length of the message because otherwise a frame with malformed length could "eat" all the data following it. This would result in the other side losing track of where subsequent messages start and end. Using a separator ensures that in case of such an error only a single message is damaged and the communication can continue without restarting the connection.

Frame structure

The frame contains the following data:

Header has a constant length. Everything between the end of the header and beginning of the next separator is the payload. The receiver can't be sure that it has received the whole payload until it gets either a separator or the stream ends.

TCP also provides additional features that increase robustness: checksums, ordering, retransmissions, fragmentation, etc.

Data frames are not encrypted but they contain a signature which protects the data against tampering. Since we do not really need the communication to be secret, this saves us the need to use a more complex solution like a SSL-encrypted connection.

Payload types

Type Payload description
ERROR And error code and an error message.
GOLEM_MESSAGE A single, serialized Golem message.
AUTHENTICATION_CHALLENGE A random string of bytes.
AUTHENTICATION_RESPONSE A digital signature of the content sent as authentication challenge.
Error codes
Code Description
InvalidFrame A malformed or incomplete protocol frame. This does not cover invalid data in the payload.
InvalidFrameSignature Frame signature does not match the content of the frame.
DuplicateFrame Another frame with the same request ID has been received over this connection before.
InvalidPayload A payload does not pass validations. Invalid types or content.
UnexpectedMessage Frame and the payload are valid but this message is not allowed at this point in the protocol. E.g. when a Signing Service receives TransactionRejected which it never should.
AuthenticationFailure Wrong response to an authentication challenge.
ConnectionLimitExceeded There are too many connections of a given type and the current connection has been closed. E.g. more than one authenticated connection with a signing service.
MessageLost It was not possible to deliver a response corresponding to the frame with the same request ID as this response. Either because the request did not reach the recipient or the response from the recipient has been lost. Getting this in a response to a lost message is never guaranteed but the it should be sent whenever possible to prevent the sender from waiting unnecessarily.
ConnectionTimeout Connection has been closed because the expected response was not received in time.

Authentication protocol

The authentication is done with a single request-response exchange. The server sends a challenge - a random string of bytes of arbitrary length. The client is expected to sign it with its private key and send the signature as a response.

The protocol is very simple thanks to the fact that we do not have to deal with key exchange. We assume that the server knows the public key of the client ahead of time.

To make it even simpler, the challenge and the response are sent directly in the payload section of a protocol frame. They're not wrapped in a Golem message.

Authentication is necessary only between the Signing Service and the Middleman. Concent always connects on a separate port that's available only from inside the cluster.

Messages

The following Golem messages will be exchanged using the wire protocol described above:

TransactionSigningRequest

Field Type Size
nonce int up to 32 bytes
gasprice int up to 32 bytes
startgas int up to 32 bytes
to string 20 bytes
value int up to 32 bytes
data binary unlimited
from string 20 bytes

SignedTransaction

Field Type Size
nonce int up to 32 bytes
gasprice int up to 32 bytes
startgas int up to 32 bytes
to string 20 bytes
value int up to 32 bytes
data binary unlimited
v int 1 byte
r int 32 bytes
s int 32 bytes

TransactionRejected

Field Type Size
nonce int up to 32 bytes
reason TransactionRejectionReason

TransactionRejectionReason enum

Code Description
InvalidTransaction The message itself is valid but does not describe a valid Ethereum transaction. Use this if it passes our validations but the Ethereum library still rejects it for any reason.
UnauthorizedAccount The service is not authorized to transfer funds from the account specified in the transaction.

Sequence of operation

signing-service-cross-functional

Concent Signing Service

The Signing Service connects to Middleman as a client but then listens for requests coming from Concent via Middleman. The underlying protocol is TCP and data sent over that is expected to conform to the protocol description above.

The service first responds to an authentication challenge and then is ready to receive TransactionSigningRequests. Each signing request ends either with a rejection or a signed transaction being sent back.

Main loop

  1. Open a TCP connection to Middleman
  2. Wait for an authentication challenge
  3. Respond to the challenge
  4. Run the connection handler

Steps above are performed in a loop. Each cycle lasts until the connection ends or drops.

The service only stops when it detects a shutdown signal from the operating system or Ctrl+C from the user.

Authentication

Immediately after establishing a connection the service starts listening for incoming messages and anything else than AUTHENTICATION_CHALLENGE is treated as an authentication failure. The service waits for the challenge for a limited time and a timeout is treated as a failure as well.

The challenge is a random string of bytes. Its randomness guarantees that an attacker won't be able to predict it and reuse a previously intercepted message for authentication - for that reason a cryptographically secure pseudo-random number generator must be used.

In response to the challenge the service sends an AUTHENTICATION_RESPONSE frame containing a digital signature of the random string. The server is expected to send an ERROR frame and terminate the connection if the authentication fails.

Authentication failures are treated the same way as connection failures. The service waits for a moment and tries again.

Connection handler

  1. Read a message from socket
  2. Validate the message and decide whether to sign or not
  3. Sign the transaction and respond with SignedTransaction or respond with TransactionRejected.
  4. Write the response to the socket

These steps are performed in a loop. Any error in the handler interrupts the handler and the connection.

Message validation

If any of the following is not true, the message is considered invalid:

An invalid message results in an ERROR frame being sent back.

If the message is valid, the service decides whether it's OK to sign it. The following criteria must be satisfied:

If any of them is not satisfied, the service responds with TransactionRejected.

Error handling

The service should deal with failure in the following way:

Any other error should crash the service. The service should log the exception, send a crash report and exit with an error code. It can expect that it will be automatically restarted.

Middleman

middleman-message-routing-with-queues

Middleman is a component that routes messages between multiple Concent processes and a single Signing Service process.

There's one, long-lived TCP connection with the Signing Service and a set of short-lived TCP connections with Concent.

For a brief time, during the authentication there may be more than one connection with clients claiming to be the Signing Service but as soon as one of them authenticates successfully, the other connections are terminated.

When Concent wants to communicate with the Signing Service, it establishes a new connection with Middleman and sends a request. Middleman starts a new handler to service the connection (Request Producer). This handler keeps listening until it receives a valid message and adds the message to the Request Queue and goes back to listening. Another handler (Request Consumer) is responsible for sending queued messages to the Signing Service. It also assigns each message a unique number and adds it to the Message Tracker. When a response comes from the Signing Service, Response Producer uses the number to pair it with the corresponding request and put it in the corresponding Response Queue. Each Response Queue is serviced by a Response Consumer which sends the response over the connection the request originally came from.

Initialization

When Middleman starts, it runs two TCP servers on separate ports:

Queues and Message Tracker are initially empty.

Middleman listens on both ports for incoming connections and when one is established, runs the corresponding connection handler.

When the Signing Service establishes a new connection with Middleman, the TCP server passes control first to the authentication handler and, if autenthication is successful, to a connection handler that's now responsible for maintaining the connection. In case of connections with Concent the logic is simpler because there's no authentication.

Authentication

Initially Middleman allows multiple connections with the Signing Service as long as none of them is authenticated. For every such connection Middleman runs Signing Service authentication handler. They can all try to pass the authentication challenge but as soon as one succeeds, the other connections are terminated.

When a connection becomes authenticated, Middleman stops accepting new connections on the external port and starts the Signing Server connection handler.

Signing Service authentication handler

  1. Generate a random string using a cryptographically secure pseudo-random number generator.
  2. Send the string in an AUTHENTICATION_CHALLENGE frame.
  3. Start listening for an AUTHENTICATION_RESPONSE frame from the service.
  4. If the service does not respond in a predefined time, send an ERROR frame and terminate the connection.
  5. Validate the response.
    • The response should contain a signature matching the random string.
    • If the validation fails, send an ERROR frame and terminate the connection. An invalid or unexpected message is interpreted as a failure.
  6. Otherwise return a success.

Signing Service connection handler

The connection handler receives a socket reader and a socket writer. Now it's time to start message handlers:

The handler is responsible for handling errors reported by message handlers and restarting them when they crash.

The handler keeps the connection open until the other side closes it. It only ever closes the connection on its own when the whole application is shutting down.

When the connection ends, the handler stops all the message handlers it has created. Message Tracker and queues are not cleared. Existing connections to Concent are kept open.

Concent connection handler

When Concent establishes a new connection with Middleman, the TCP server passes control to a handler that's now responsible for maintaining it. Middleman keeps track of multiple connections by assigning them unique IDs.

The handler receives a socket reader and a socket writer. It immediately starts a Response Consumer (which gets the writer) and a Request Producer (which gets the reader).

The handler is responsible for handling errors reported by its consumer and producer and restarting them if they crash.

The handler keeps the connection open until the other side closes it. It only ever closes the connection on its own when the whole application is shutting down.

After the connection ends, the handler stops its producer and consumer.

Request Producer

Request Producer receives a socket reader and waits for incoming messages. Each message is added to the Request Queue along with the ID of the connection it came over.

If the producer crashes before it manages to add the message to the queue, the message is lost.

Request Queue

The Request Queue is a synchronization mechanism used for ordering messages coming from multiple Request Producers and storing them in memory.

Every item in the queue consists of:

Request Consumer

Request Consumer receives a socket writer for the connection with the service when it starts.

The handler keeps consuming messages from Request Queue and sending them to the Signing Service.

If the Response Queue corresponding to the connection ID does not exist, message is dropped silently. Middleman won't be able to deliver the response anyway so there's no point in even sending the request.

Every message is assigned a unique Signing Service request ID and sent along with it. This ID is added to the MessageTracker along with the corresponding connection ID. This will make it possible for Response Producer to know which connection should be used to pass the response to Concent.

Message is not removed from the queue until it's successfully sent over the connection. This way the consumer can retry if it crashes and is restarted by the connection handler.

Message Tracker

Message Tracker is a mapping between messages and Concent connections they came over.

Request IDs are used as keys and must be unique. There may be multiple messages coming from the same connection.

Entries are added by Request Consumer and removed by Response Producer.

Each entry has the following information associated with it:

The order in which entries were sent and added to the tracker is preserved. The Signing Service is required to send responses in the same order. If the Middleman receives a response from the Signing Service out of order, all the requests sent before it are considered lost and the corresponding entries removed from the Message Tracker (more about it below).

Response Producer

Response Producer listens for messages coming from the Signing Service and adds them to the Respose Queues.

The message must come along with an ID. This ID is used to look it up the connection ID in the Message Tracker. If there's no corresponding entry, the message is ignored and the processing ends.

Otherwise the producer starts by discarding any preceding messages from Message Tracker. For each one, an ERROR frame is added as a response to the Response Queue corresponding to the connection ID.

Then the response is added to the Response Queue corresponding to its connection ID. The entry is removed from the Message Tracker.

If there's no queue corresponding to a response (e.g. the connection has already been closed), the response is silently discarded.

Processing of each entry from the Message Tracker should be atomic - i.e. either both the entry is removed and a response is queued or neither queue nor tracker is modified. This is to prevent situations where Concent gets no response or two responses for a single request.

Response Queue

Each Respons Consumer has its own Response Queue. The queue stores messages that should be sent back over a particular connection with Concent.

Every item in the queue consists of:

Response Consumer

There is a separate instance of Resource Consumer for each connection with Concent. The consumer receives a socket writer for the connection and has access to a single Response Queue. It keeps consuming messages from the queue and sending the over is connection until Concent closes it.

The message is sent with request ID matching the one the original request from Concent had.

If the consumer crashes, Middleman closes the connection and all the data that was not yet sent to this particular Concent instance is lost.

SCI transaction signing callback

The callback

The callback is a piece of code that is supplied by Concent runs inside SCI. It receives an object containing an unsigned transaction and is responsible for putting a signature in the object.

It operates in the following way:

  1. Create a TransactionSigningRequest.
  2. Establish a TCP connection with Middleman.
  3. Send the TransactionSigningRequest.
  4. Read the response.
  5. Close the connection.
  6. If the response comes in time and is a valid SignedTransaction, copy the signature to the transaction object passed to the callback by SCI and return.
  7. Otherwise interrupt the handler and crash the request.

The request that triggers the callback

The callback is going to run in the context of a request from a Golem client submiting or retrieving a protocol message. The exceptions should interrupt the request and result in a HTTP 502 response. This way all the incomplete changes get rolled back and the client hopefully tries again.

Synchronization

It's currently not possible to run two SCI operations in parallel. For that reason each operation should operate inside a critical section. This means that at any given time there will only be a single request and response pair going through Middleman. This is an obvious performance bottleneck and hopefully a better solution will be found soon.

cameel commented 6 years ago

@dybi Diagrams reperesenting the two versions of message routing in Middleman I have shown you yesterday:

With queues

With one change - we do need to keep track of messages after all. But message IDs will not be inside messages.

This one with queues will be added to the blueprint but I'm pasting both here in a comment in case I need to refer to them later.

middleman-message-routing-with-queues

Without queues

Essentially the same as the one that's in the blueprint now but in a slightly different style.

This one looks simpler but really it's just showing less detail which is why I prefer the one with queues - even if the queues remain just a concept rather than actual implementation.

middleman-message-routing-no-queues

dybi commented 6 years ago

Open a TCP connection to Concent

@cameel , I guess this is a little bug. You meant MiddleMan, didn't you? ;)

cameel commented 6 years ago

Yeah. Fixed.

cameel commented 6 years ago

Update: I have moved wire protocol description from #618 to the blueprint above.

cameel commented 6 years ago

Update:

cameel commented 6 years ago

The blueprint is now complete. There are still small issues that need to be ironed out but no more major rewrites are needed and all the issues that describe the implementation have been created.

Changes:

And here's a map showing dependencies between the issues :)

Signing Service

          /-----------------> #632
         /
#633 ----
         \
          #625 ---> #623 ---> #599
         /
#631 ---/

Middleman

#629 --------\
              \
               ---- #618 ---> #615
              /
#630 --------/
             \
              \-------------> #616

Callback

#635 -----------------------> #632

#636
cameel commented 6 years ago

Update: MiddlemanError replaced with a special ERROR frame.

kbeker commented 6 years ago

@cameel

Created using the same key that's used for signing Golem messages.

How it should looks like? Should it be done same way like in golem_messages that you pass deserialized message with keys and on function output gets serialized? Or should message should be hashed? If it should be hashed so which method should we use?

cameel commented 6 years ago

I'm not sure I understand what you mean. In both cases the message gets hashed because the signature is always computed using a hash of the content.

When you want to sign a frame you should get the whole serialized content, compute a hash and pass it to the same cryptographic functions Golem uses when it creates a signature (with the same key the service uses to sign its Golem messages).

How you organize all this into functions is a question to @rwrzesien.

cameel commented 6 years ago

Updated wire protocol description:

cameel commented 6 years ago

Updated:

dybi commented 6 years ago

@cameel, in case of ErrorFrame, when original request_id is not available, 0 should be used

rwrzesien commented 6 years ago

@cameel Please change from field name to from_address.