tikv / sig-transaction

Resources for the transaction SIG
62 stars 13 forks source link

Adapt async commit/1pc with multi data center #63

Open MyonKeminta opened 3 years ago

MyonKeminta commented 3 years ago

Background

We are going to finish async commit in 5.0, and there's going to be another important feature in 5.0: Local/Global transaction in cross DC deployment.

In that feature, there might be multiple PDs in a cluster allocate timestamps simultaneously. There are two kinds of timestamps: local timestamps and global timestamps. A local timestamp can be guaranteed to be globally unique (although it's not yet implemented), but is not guaranteed to be globally monotonic. A global timestamp is guaranteed to be greater than all (global or local) allocated timestamps and less than all (global or local) timestamps that will be allocated. A transaction can be a local transaction (that uses a local timestamp) or a global transaction(that uses a global transaction).

The problem

The problem is, if multi-DC is used with async commit/1pc enabled, the commit_ts calculation becomes complicated. We can't record only one max_ts any more. For example, in this case:

  1. Read k1 from DC1, start_ts (allocated from local allocator 1) = 100, then the max_ts is updated to 100;
  2. Write k2 from DC2, start_ts (allocated from local allocator 2) = 50, calculated commit_ts = 101 (max_ts + 1)
  3. Commit k2 from DC2
  4. Read k2 from DC2, start_ts (allocated from local allocator 2) = 51, then the previous transaction's result is invisible.

Possible solution

Maybe we need to maintain a map (DC -> max_ts) instead of a single max_ts in TiKV. TiDB's requests to TiKV need to mark which DC it belongs to, or if it's a global transaction. TiKV updates (or gets) the max_ts corresponding to that DC, or update all max_ts-es (or gets the max one) if it's a global transaction. When leader transferring or region merging or something happens that needs updating the max_ts, get a global ts from PD and update all max_ts-es.

Then there might still be many corner cases. For example, the number of DC-es may changes dynamically, which introduces more complexity in maintaining max_ts. One of the ways is to record both local max_ts and global max_ts for ts calculation, like this:

struct MaxTs {
    global_max_ts: AtomicU64,
    local_max_ts: [AtomicU64; MAX_DC_COUNT],
}
MyonKeminta commented 3 years ago

cc @sticnarf @cfzjywxk @disksing

sticnarf commented 3 years ago

Per-DC max_ts seems like the simplest solution.

wish we had vector clock so much

nrc commented 3 years ago

Is async commit disabled for global transactions at the moment?

sticnarf commented 3 years ago

Is async commit disabled for global transactions at the moment?

AFAIK it's not implemented yet.

disksing commented 3 years ago

Still working on TiDB side. We cannot use global/local transactions now.

youjiali1995 commented 3 years ago

Read k1 from DC1, start_ts (allocated from local allocator 1) = 100, then the max_ts is updated to 100; Write k2 from DC2, start_ts (allocated from local allocator 2) = 50, calculated commit_ts = 101 (max_ts + 1)

Querying data in different DC should be global transaction.

sticnarf commented 3 years ago

Read k1 from DC1, start_ts (allocated from local allocator 1) = 100, then the max_ts is updated to 100; Write k2 from DC2, start_ts (allocated from local allocator 2) = 50, calculated commit_ts = 101 (max_ts + 1)

Querying data in different DC should be global transaction.

Oh, I think I might misunderstand it before. So the problem described in the issue description is not valid.

The actual problem I think is the order between global and local TSO. If we allow async commit for global transactions, and if the global TSO is faster, a global TS pushed forward max_ts, then a transaction may be unseen to a local reader.

MyonKeminta commented 3 years ago

According to @disksing , DC-belonging of data is usually defined table by table, and it's not guaranteed that a region will never span over multiple tables. As I understood it, local timestamp is preferred if it's sufficient to guarantee the consistency, no matter where the actual location of the data is šŸ¤” @disksing Can you clarify it for us?

disksing commented 3 years ago

Try to consider the scenario of updating the configuration to change the DC some data belongs. The data needs some time to schedule before it can actually be transferred to another DC, and before the schedule is complete, the binding between the data and the DC is already switched, so there will be a window where the data belongs to DC1, but the actual location is in DC2.

So, a TiKV node (or even a region) may contain keys that belong to different DCs. Instead of maintaining a single max_ts, we need to maintain _maxts of different DCs.

sticnarf commented 3 years ago

The data needs some time to schedule before it can actually be transferred to another DC, and before the schedule is complete, the binding between the data and the DC is already switched, so there will be a window where the data belongs to DC1, but the actual location is in DC2.

Can we consider the DC is not switched before finishing data transfer? If there is indeed some data in the remote DC, I think it reasonable to still use global transaction.

nrc commented 3 years ago

IMO, we should only support async commit + local txns in 5.0 and add async commit + global later. These two features are already large and significant, and I don't think we should add more risk to the 5.0 release.

cfzjywxk commented 3 years ago

IMO, we should only support async commit + local txns in 5.0 and add async commit + global later. These two features are already large and significant, and I don't think we should add more risk to the 5.0 release.

I agree that it will be many details to consider and mainting max_ts at DC level seems not that easy.

disksing commented 3 years ago

I think switch data binding ahead is better because global transactions are heavy and slow. We can benefit from local transactions when r/w data that is already transferred.

nolouch commented 3 years ago

The data needs some time to schedule before it can actually be transferred to another DC, and before the schedule is complete, the binding between the data and the DC is already switched, so there will be a window where the data belongs to DC1, but the actual location is in DC2.

Can we consider the DC is not switched before finishing data transfer? If there is indeed some data in the remote DC, I think it reasonable to still use global transaction.

If the max_ts is maintained at the region level, we only need to split the cross-DC regions before switching. but now the max_ts at the store level, which means we need to transfer data to the corresponding DC (heavy operator) before switching.

As Local/Global TXN already basically works (https://github.com/pingcap/tidb/issues/20448), we need a solution to this problem. Does max_ts will support at the region level?

nrc commented 3 years ago

I think the solution for now is that async commit is only used for local transactions and we use 2pc for global transactions. Does it solve the data transfer issue?

djshow832 commented 3 years ago

IMHO, there are some possible solutions:

  1. Local transactions don't async commit, but global transactions do. It's practical to only maintain one max_ts in each store because a global TSO is comparable with a local TSO.
  2. Global transactions don't async commit but local transactions do. It's impractical because each store may still have more than one max_ts to maintain, each corresponding to one DC. Besides, global transactions are common cases because the local/global transaction feature is barely enabled.
  3. Forbid users to alter placement policies online when the async-commit feature is enabled. It's also impractical because it's very hard to confirm that there is no data transferring before async-commit is enabled.

So I choose the first way.

MyonKeminta commented 3 years ago

Is there any problem about the idea mentioned at first in this issue, which means to make async commit work for both local and global transactions?

nrc commented 3 years ago

Is there any problem about the idea mentioned at first in this issue, which means to make async commit work for both local and global transactions?

Complexity. Async commit is already way too complex for my liking and we can't add more at this stage of the project, we need to leave it to a follow up.