primary_replica value does not match raft_leader

riteshsinha-ritz commented 4 years ago

I created a 3 server cluster. I then closed the instances and then brought them up again. I was expecting the primary_replica value to match raft_leader value as part of the system table - table_status output, but instead I saw this -

{'db': 'ycsb',
  'id': 'cfda1c0a-5bfd-465f-9299-1a9e95c2ee70',
  'name': 'usertable',
  'raft_leader': 'rethinkdb_first',
  'shards': [{'primary_replicas': ['rethinkdb_third'],
              'replicas': [{'server': 'rethinkdb_third', 'state': 'ready'},
                           {'server': 'rethinkdb_second', 'state': 'ready'},
                           {'server': 'rethinkdb_first', 'state': 'ready'}]}],
  'status': {'all_replicas_ready': True,
             'ready_for_outdated_reads': True,
             'ready_for_reads': True,
             'ready_for_writes': True}}]>

The highlight being 'raft_leader': 'rethinkdb_first' while 'primary_replicas': ['rethinkdb_third'] .

RethinkDB docs mentions that - All reads and writes to any key in a given shard always get routed to its respective primary where they're ordered and evaluated.

If this is the case then when the raft_leader changes does all the client request still get routed to the primary replica? What happens when raft_leader value doesn't match primary_replica value? I was expecting that primary replica value would always be the RAFT leader.

srh commented 4 years ago

When there are multiple shards, their primary replicas would typically be different servers, but there is only one raft leader. So in general, the primary replicas aren't necessarily the same as the raft leader. The raft cluster manages table configuration; writes do not get routed through it.

riteshsinha-ritz commented 4 years ago

Thanks Sam. Much appreciated.

rethinkdb / rethinkdb

primary_replica value does not match raft_leader #6833