kontena / kontena

The developer friendly container and micro services platform. Works on any cloud, easy to setup, simple to use.
https://www.kontena.io/
Apache License 2.0
1.47k stars 127 forks source link

Agent RPC IDs could theoretically collide #1924

Open SpComb opened 7 years ago

SpComb commented 7 years ago

The server RpcClient choses a random integer between 0..2_147_483_647 as the RPC request ID:

  def request_id
    rand(2_147_483_647)
  end

The agent RpcClient has a more complicated logic where it tracks pending requests, and avoids currently in-flight request IDs:

    def request_id
      id = -1
      until id != -1 && !@requests[id]
        id = rand(2_147_483_647)
      end
      id
    end

See #1855 for some discussion about this:

It is possible for these request IDs to collide, if an RpcClient#request task manages to choose the same request ID as some previous request that is still in flight, waiting for the remote server to respond. Request ID collisions could lead to nasty side-effects, where RPCs requests return unexpected results.

The current stateful agent implementation will work to avoid Request ID collisions, unless it crashes with in-flight RPCs. However, the agent implementation also leaks memory on RPC request timeouts, as it can't re-use those request IDs...

The simplest approach would be to use something similar to TCP sequence numbers... initialize @request_id = rand(1 << 32) and then use @request_id = (@request_id + 1) % (1 << 32).

However, this would not work on the server, where there will be multiple proceses with independent RpcClient states initiating requests... perhaps it would make more sense to use the MongoPubsub ID for the RpcClient implementation, and then have each WebsocketBackend choose their own request IDs?

jnummelin commented 7 years ago

One other approach is to use Twitters snowflake pattern where the id is combined from timestamp, node/worker number and running counter. So something like:

(Time.now.to_i << 22) | (node_num << 12) | counter++
jakolehm commented 7 years ago

I really would like an explanation of these magic numbers ... :)

jnummelin commented 7 years ago

https://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems slides 23&24

SpComb commented 7 years ago

Talking of magic numbers... the current rand(2_147_483_647) is only using half of the uint32 space... it should be rand(1 << 32), or rand(4294967296)?

jakolehm commented 7 years ago

@SpComb probably yes, but we did hit into some weird msgpack issues back in the days.. maybe msgpack updates have fixed this.