pmem / pmemkv

Key/Value Datastore for Persistent Memory
https://pmem.io
Other
398 stars 119 forks source link

FEAT: transactions #810

Open igchor opened 4 years ago

igchor commented 4 years ago

FEAT: transactions

Rationale

This is an alternative solution to batch_put: https://github.com/pmem/pmemkv/issues/708 more similar to what RocksDB offers.

Description

The idea is to expose a transaction class which can be used to group puts (and maybe also removes and updates?) into one atomic action.

API Changes

C++


/* For pmemobj transactions we can only have one tx per 
 * thread but having this as an explicit structure 
 * (instead of relying on tls and closure-like tx is a 
 * more generic solution imho). Also, this is quite similar
 * to RocksDB api.
 */
class transaction { // or class batch
    status put(string_view key, string_view value);

    /* (Optional? what about isolation?) */
    status remove(string_view key);

    status commit();
    void abort();
};

class db {
    transaction begin_transaction(void* options); // TODO: should options be struct or config?
};

C

pmemkv_tx* pmemkv_tx_begin(pmemkv_db *db, void *options);
status pmemkv_tx_put(pmemkv_tx* tx, const char* k, size_t kb, const char* v, size_t vb);
status pmemkv_tx_remove(pmemkv_tx *tx, const char *k, size_t kb); // OPTIONAL
status pmemkv_tx_commit(pmemkv_tx *tx);
void pmemkv_tx_abort(pmemkv_tx *tx);

Implementation details


/* There are several ways we can implement put for the transactions.
 * We can perform in-place updates using undo logs - this is pretty
 * simple for single threaded engines (for concurrent ones we
 * would need to delay modification to commit - once we have all the locks
 * since only then we can start the obj transaction). However this would mean
 * that all non-transactional operations (e.g. from the same thread) would
 * see the non-commited state (which is inconsistent with RocksDB) unless we ban such operations...
 * Moreover in this approach, for non-existent elements allocations are done in a
 * separate transactions (since we cannot call cmap.insert() in a tx) which
 * has performance penalty.
 * 
 * An alternative solution would be to start a obj tx in pmem::kv::transaction constructor.
 * On put, we can allocate appropriate node type and put it in a persistent_tls
 * (which will act as redo log). On pmem::kv::transaction::commit() we first commit
 * the changes to redo log and finish the transaction. After that, we iterate over
 * the redo log and insert the elements in the map (we can also acquire all the locks
 * first but since we do not have a snapshot isolation for reads I believe this is not
 * necessary) using something like insert(node_type&&) which will just replace the pointers.
 * 
 * For certain engines (e.g. csmap) we cannot just allocate the node and replace the existing
 * ones in the datastructure (since we do not have garbage collection). However, we can still
 * allocate the value (pmem::obj::string) and just call insert_or_assing(std::move(key), std::move(value));
 * 
 * This approach means we have more allocations (even if the node already exists we will
 * allocate anyway) but all allocations will be in the same tx so the cost should be
 * acceptable.
 */
transaction::transaction() {
    this->obj_tx = new pmem::obj::transaction::manual();

    persistent_tls[tx.id] = make_persistent()
}

transaction::put(string_view key, string_view value) {
    persistent_tls[tx.id].push_back(make_persistent<map::node_type>(key, value));
}

transaction::commit() {
    pmem::obj::transaction::commit();
    delete this->obj_tx;

    // Optional? We do not need full isolation since we do not have range reads nor
    // snapshot isolation.
    // std::vector<lock> locks;
    // for (auto &e : persistent_tls[tx.id]) {
    //     auto it = map.find(e.key);
    //     if (it != map.end())
    //         locks.push_back(it->second.lock);
    // }

    for (auto &e : persistent_tls[tx.id]) {
        auto ret = map.insert(std::move(e));
        // Optional?
        // if (ret.second) // element inserted
        //   locks.push_back(ret.first->second.lock)
    }

    pmem::obj::transaction::run(pop, [&]{
        delete_persistent(persistent_tls[tx.id]);
    });
}
karczex commented 4 years ago

I suggest to expand this API with transactional operations inside lambda expression, and get of not yet committed key:

...
auto *kv = new pmem::kv::db() ;
kv->open("enigne", std::move(cfg));
auto s = kv->transaction([&](pmem:::kv::transaction t) {
        t-> put(key, value); //commit when lambda expression ends
        kv->get(key); // returns value as is outside transaction 
        t->get(key); // returns value submitted by transactional put. This may be very useful if you update part of value, and want to abort on some conditions.   
    }); //commit 
igchor commented 4 years ago

@karczex one thing to consider for closure transaction is nesting. For current engines I think we should just allow for single tx per thread but we can relax this in future.

When we have an explicit transaction class I would like to be able to do something like this:

auto tx1 = db.begin_transaction();
tx1.put("k1", "v1"):

auto tx2 = db.begin_transaction();
tx2.put("k2", "v2"):

tx2.commit()

// tx1 is not commited yet, "k1" is not visible

tx1.commit();

For closure-like tx we could either do the same thing (e.g. inner tx is independent) or we flatten the transactions like in C++ (so that, the commit is actually delayed to the outer tx commit). I would prefer the first approach but it will not be consistent with libpmemobj-cpp...

igchor commented 4 years ago

As we talked on the meeting, we can just ban from using nested transactions in case of lambdas. This is probably the simplest approach.

pbalcer commented 4 years ago

What you should think about is read or read for write type operations for transactions. If those aren't possible due to technical reasons, I'm not sure if there's much of a benefit to having transactional API vs a simple batch put.

igchor commented 4 years ago

@pbalcer I was thinking about exposing read/read_for_update via iterators (see read_range in https://github.com/pmem/pmemkv/issues/809). So, there would be no explicit get() method for a transaction but you could mix put's and updates (via iterators) in a single tx. Moreover, with transnational API, we can also mix put's and remove's (we don't have to implement it now, but we can extend that in future).

Anyway, I think the batch_put (in C API at least) would actually look pretty similar. We would have to expose some batch structure (instead of transaction structure) and allow users to call put on that.