mongodb-partners / mongo-rocks

MongoDB storage integration layer for the Rocks storage engine
401 stars 100 forks source link

support mongodb4.2 #155

Closed wolfkdy closed 3 years ago

wolfkdy commented 4 years ago

1) add SetPrepareTimestamp and Prepare API to totdb 1.1) Prepared writes are global visible(SI aspect) to all txns started after the prepared txn. the code below should correctly describe the rule

struct Txn {
    uint64_t start_snapshot;
    uint64_t prep_ts;
    uint64_t read_ts;
    uint64_t commit_ts;
} a, b;

a is visible to b iff:
b.start_snapshot > a.start_snapshot and
b.read_ts > a.prep_ts
with the above rule held, particularly, before a is committed, b should wait and return PREPARE_CONFLICT error because b could not tell if it's read_ts is greater than a's committs.

2) Prepare does not guarantee durability of data 2.1) as mongodb write prepared things into oplog by an individual transaction 2.2) after 2.1, mongo negotiate the max(prepareTs) as the commitTs, which mainly implements the algorithm described in CLOCK-SI paper 2.3) commit the prepared things with the timestamp negotiated in 2.2) 2.4) write a commit tag into oplog to finish the process.

wolfkdy commented 4 years ago

To keep simple, I'd like the prepared-data to come into lsm-tree only when committed. the so called "write-committed" strategy.

wolfkdy commented 4 years ago

PrepareMap

When prepared, txn's data should be refered in a global ordered(by key) map, we call it prepareMap. So that transactions can see all the prepared transactions. An individual transaction can see data from prepareMap, or its own writeBatch or lsmTree.

PrepareMergingIterator

PrepareMergingIterator is used to merge WriteBatchWithIndexIterator(a merge of lsmtree and transaction's writeBatch) and PrepareMapIterator. If a same key both exists in WriteBatchWithIndex and PrepareMap, we should return both values to the upper-level PrepareFilterIterator to decide which one to choose.

PrepareFilterIterator
  |
  --PrepareMergingIterator
       |
       -- PrepareMapIterator
       |
       -- BaseIterator
            |
            -- WriteBatchWithIndexIterator
            |
            -- Normal LsmTree Iterator
                 |
                 -- ......
//
// 1) `PrepareFilterIterator` checks prepare status of an input entry and
// decides to return, wait, or advance to the next record
// 2) PrepareMergingIterator arranges PrepareMapIterator and BaseIterator into
// total-order. if PrepareMapIterator and BaseIterator has the same key, they
// are both returned by `ShadowValue`, it is impossible that the same key
// comes from me(or WriteBatchWithIndexIterator) because the only operations
// after `prepare` is rollback or commit
wolfkdy commented 4 years ago

interfaces for CLOCK-SI are ready https://github.com/wolfkdy/rocksdb/commit/3a12c627e5dc9081827c4e7fecbaf0f5673d802c the interfaces are similar to wiredtiger's prepare api.

wolfkdy commented 4 years ago

https://github.com/mongodb-partners/mongo-rocks/tree/v4.2.5_rc1 core_txns, concurrency_replication_multi_statement_txn suites are passed.

Tsunaou commented 3 years ago
  1. add SetPrepareTimestamp and Prepare API to totdb 1.1) Prepared writes are global visible(SI aspect) to all txns started after the prepared txn. the code below should correctly describe the rule
struct Txn {
    uint64_t start_snapshot;
    uint64_t prep_ts;
    uint64_t read_ts;
    uint64_t commit_ts;
} a, b;

a is visible to b iff:
b.start_snapshot > a.start_snapshot and
b.read_ts > a.prep_ts
with the above rule held, particularly, before a is committed, b should wait and return PREPARE_CONFLICT error because b could not tell if it's read_ts is greater than a's committs.
  1. Prepare does not guarantee durability of data 2.1) as mongodb write prepared things into oplog by an individual transaction 2.2) after 2.1, mongo negotiate the max(prepareTs) as the commitTs, which mainly implements the algorithm described in CLOCK-SI paper 2.3) commit the prepared things with the timestamp negotiated in 2.2) 2.4) write a commit tag into oplog to finish the process.

According to your issue, can I think that the multi-document transaction cross shards of MongoDB 4.2 is based on the ability of the database engine (rocksDB or wiredTiger) to implement Clock-SI?

wolfkdy commented 3 years ago

@Tsunaou yes, you are right. there is an chinese blog describing the details: https://mongoing.com/archives/38461

wolfkdy commented 3 years ago

done by the release of v4.2.5_rc2 Futher findings or problem fixes will be tracked by new issues.