k3s-io / kine

Run Kubernetes on MySQL, Postgres, sqlite, dqlite, not etcd.
Apache License 2.0
1.56k stars 233 forks source link

The active-active mysql cluster primary key conflict because of the auto_increment_increment parameter not working #71

Closed linyinli closed 1 year ago

linyinli commented 3 years ago

I make a active-active mysql cluster use for K3S datastore, to avoid primary key conflicts,I added the following parameters to mysql: Mysql 1 my.cnf: auto_increment_offset = 1 auto_increment_increment = 2

Mysql 2 my.cnf: auto_increment_offset = 2 auto_increment_increment = 2 When I use insert SQL to test, it works well. But when I use K3S, the parameter not working. Why does the KINE make the parameter not work?

brandond commented 3 years ago

Kine assumes that there will be no gaps in the auto-increment sequence - that is to say, they are strictly sequential. Any clustering configuration that violates that constraint will most likely cause problems for Kine.

poblin-orange commented 3 years ago

This constraints exclude any native active active sql engine, in particular mysql galera. Is this a possible enhancement on kine or by design constraint ?

brandond commented 3 years ago

This is a design constraint due to how we use the auto-increment pk as the MVCC revision counter. If you need an active-active backend you're probably better off going with etcd.

zqzten commented 3 years ago

The auto-increment sequence should be strictly incremental is reasonable, but I don't quite get the point that it should be strictly sequential, will a sequence like 1->3->5->... cause any problem in Kine? In other words, what's tha main purpose of filling GAPs?

brandond commented 3 years ago

It needs to be strictly sequential, as we are using the sequence numbers as MVCC revision counters which are STRICTLY monotonically increasing. Any gap would indicate that we have missed some changes to the datastore. Under normal circumstances there should not be any gap records; these are created as an attempt to handle unexpected behavior - when the database is operating as required their existence indicates that one of the Kine datastore clients missed some events in the watch stream.

zqzten commented 3 years ago

@brandond Thanks for the detailed explanation. But I don't agree with this part:

Any gap would indicate that we have missed some changes to the datastore.

It is not always the case, for example, an insert failure of MySQL will result in a gap. Say we have two Kine client connecting to one MySQL database, and they receive a create request with the same key at the same time (will likely happen in multi instance apiserver bootstrapping), at that time they both try to insert a same key with same prev_revision(0) into the database, and this will result in a gap. Here's what happened in detail:

  1. At first, we suppose the incremantal id of table kine is 6.
  2. Kine client A succeeds to insert the key, now the id is 7.
  3. Kine client B then try to insert the same key, this insertion will fail due to duplicated unique index (name, prev_revision) but still increase incremantal id.
  4. Kine client A/B performs next insertion, at this time the id of the newly inserted row is 9, not 8.

We can see that in the case above, watch will absolutely meet a gap 8, however it actually misses nothing because 8 never appeared and will never appear in the database. And it is also the case in #82, when OB proactively increase the id to a big number (only increase, never decrease), no data will be missed in watch but the FILL logic itself will bother filling gaps in database and result to a very long time watch pause.

At last, we removed the FILL logic and let watch just skip gaps. After testing it for a few days, we found Kine work as expected with no data lose or client exceptions. So IMO the real rule of the incremantal id for Kine is: It should increase monotonically, but not has to increase with step 1. A sequence of id like 1->3->5->... is good enough for Kine.

kobe2000 commented 2 years ago

I am facing the same problem, any solution yet?

dweomer commented 2 years ago

@brandond I think gaps should be fine as far as the vector clock stuff is concerned but I seem to remember we did have a customer exhaust their key-space (due to some wild error conditions driving resource churn). Remind me, did we come up with a solution for that?

brandond commented 2 years ago

Maxing out the auto increment pk column? I think I've only ever seen it the once. Someone suggested changing the column type to something with more range but I have yet to see that be necessary outside severely broken clusters.