pmem / pmdk

Persistent Memory Development Kit
https://pmem.io
Other
1.34k stars 510 forks source link

FEAT: cross pool transactions #4198

Closed seghcder closed 1 year ago

seghcder commented 5 years ago

Scenario: An application manages multiple discrete pools for performance or NUMA reasons.

Issue: At the moment the library does not support a commit/rollback where data is being updated in two or more pools, eg via nested transactions. Also ref pmem/pmdk#4123, the transactions may not be all committing in the same pool.

Workaround is for the application to manage some form of two phase commit across two or more threads (ref also pmem/pmdk#4127)

krzycz commented 5 years ago

It has been discussed a long time ago. The key problem is to guarantee transaction atomicity across multiple pools. I mean, if the transaction is torn by a power failure, then how to recover from a transaction that has been committed in one pool, but has not been fully committed in the second one? The old data in undo/redo logs of the first pool could have been already overwritten, so there is no way to roll-back. If we decide to postpone undo/redo log cleanups until the transaction is committed on both sides, then effectively those two pools become one logical pool and must always be open together, as potential recovery in one pool may affect the second one and vice versa.

The problem described in pmem/pmdk#4127 is different. This is when the application (or the library this app links to) is using more than one pool and runs two independent transactions on different pools simultaneously. If one TX commits, but the second one does not, it does not lead to any serious problem.

seghcder commented 5 years ago

My reference to pmem/pmdk#4127 was that I understand it wouldn't be possible for an application to do a two phase commit in the same thread while issue pmem/pmdk#4127 exists. We'd need to co-ordinate between two separate threads. Or is the issue at process level?

If the case is closed on library support for cross pool transactions, then that's also ok :-)

pbalcer commented 5 years ago

@seghcder Yes, to properly implement this functionality, we'd need a way for group transaction commit. And now when I think about how the API would look like, we would also need pmem/pmdk#4127, since there needs to be a way to actually run a second transaction before we do the group commit. We also would need a handle to the transaction (struct pobj_tx_ctx *) that we could provide to the group commit function.

@krzycz Right, but let's look at an example:

struct root *rootA = ...;
TX_BEGIN(popA, popB) {
  pmemobj_tx_add_range_direct(popA, &rootA->oid, 16);
  rootA->oid = pmemobj_tx_alloc(popB, ...);
} TX_END

We just created a one-way reference. To make this work, we would need to synchronize the on_commit phases of those two pools, and only commit if we are 100% sure that both pools will proceed with the commit - just like you say. We would need to forbid cross-pool snapshots though... Each pool need to be responsible for its own logs. But those two pools CAN be used standalone. In the case of popB, there's no issue. In the case of popA, we will get NULL from pmemobj_direct, so the application can account for the fact that a certain amount of data is currently not "connected".

A better question would be: why? The API would look bad... The only reason I can think of right now is having separate pools for NUMA nodes, so that application can optimally manage and allocate memory in a multi-socket platform. But that could be solved much easier at an allocator level by just adding a hint to palloc to allocate from a certain address range...

Anything else?

seghcder commented 5 years ago

Just adding a few more use potential use cases -

marcinslusarz commented 5 years ago
  1. Can you explain how did you measure this?

  2. and 3. look more like a request for dynamic/advanced space management in obj...

  3. You can use allocations classes to solve some problems related to fragmentation.

seghcder commented 4 years ago

Adding xref to https://github.com/pmem/libpmemobj-cpp/issues/752 - re having to limit functionality due to lack of support for cross pool commits.

@marcinslusarz I've found having multiple separate pools and striping objects across them gives higher multi-threaded commit throughput than having a single large pool. However, it may also be related to how I'm splitting work across the threads and pools. Will try to come up with a demo case.

Another use case for multi-pool systems is an Entity-Component-System (ECS) pattern using a separate pool(set) for each Component type. This means one can update one Component implementation and not affect others.

Also, for NUMA-aware apps that need to manage their own pools, cross pool transactions are useful.

It does seem like there was some work done a while back in https://github.com/wojtuss/pmdk/commit/da573040d8551e06b663b5cfbced24001e0e99dc but never pulled.

marcinslusarz commented 4 years ago

If I understand correctly, your performance improvement comes from parallelizing transaction commit and this means that cross-pool transactions wouldn't help you. Parallelizing transaction commit is a feature on its own - I think you should create a new feature request and describe your use case.

Again, ECS needs should be fulfilled by allocation classes (already available).

NUMA is something that needs to be investigated.

wojtuss@da57304 was about having multiple completely independent transactions, not about cross-pool transactions.

seghcder commented 4 years ago

I've created a multipool test utility to test some of my assumptions/issues re multipool performance.

Some initial results are here https://github.com/axomem/multipool-test/issues/5. Running write transactions in 48 threads is ~4x faster by splitting into a pool file per thread, versus writing having all threads write into the same pool file on the same socket/region.

The cross-pool issue comes when eventually you need to do something like:

  1. Create an object in Pool A
  2. Add it to a pmem::obj::concurrent_hash_map in Pool B

If the map insert fails, you need to roll back A. If you lose power in between 1 and 2, you have a leak. Ideally, these two would be wrapped in a single transaction.

Re ECS/Components - I discuss this in more here https://github.com/axomem/nucleus/issues/22. At the moment we're using libpmemobj-c++ including several of the containers. If I change a single class implementation, it generally invalidates the layout (exception mentioned by @pbalcer ). If data is 4x1.5TB large pools, this means I need a way to dump and reload 6TB of data. If we use a pool (layout) per component, then only the Component needs a transform/reload.

If there is a way to use allocation classes with libpmemobj-c++ to avoid this issue, perhaps you could add some thoughts to https://github.com/axomem/nucleus/issues/22?

Re https://github.com/wojtuss/pmdk/commit/da57304 - I saw the description

This lets nesting transactions on different pools.

... and assumed this is something like what we are looking for.

pbalcer commented 4 years ago

It's interesting that you observe higher performance using multiple pools. What's the size of the allocations? The one reason that I can think of for your observed behavior is that very large allocations are serialized for fragmentation. You could, however, use allocation classes with a large number of units per block to achieve a similar result.

The transactional context feature would allow you to have multiple parallel transactions in the same thread, including nested (but not related) transactions on separate pools.

I agree with you that this would be a useful functionality. But I'm not sure if it wouldn't be better to solve those performance problems (NUMA, allocation scaling) within a single pool rather than optimizing what is effectively a workaround for those problems.

seghcder commented 4 years ago

The arrays 2.5 million items of struct with a char member, however during writing (where there is the biggest difference) - only a single char is being written to in each transaction...

        for (auto i = 0; i < array_size; i++) {
            transaction::run(pop, [&] {
                my_array[i].c = 'a';
            });
        }

Code is here.

If we get similar transaction throughput in a single pool vs multiple pools, I agree it would be simpler to have a single pool spanning sockets. One additional plus for multi-pools is that I could move the data from a 2 socket server to a 4 socket server and just rearrange the pool files (assuming I started with 16, for example). Using the NUMA-aware addressing would mean you'd now have a bunch of objects starting from socket 0 and 2 addresses but none (or overlaps) at 1 and 3. A downside though is that space management becomes a bigger issue.

The other (non-performance) use case for separate pools is the layout invalidation issue (https://github.com/axomem/nucleus/issues/22). I think this is more a libpmemobj++ issue though? This is probably a bigger topic to resolve (for me :-) ) vs NUMA/performance. Ie, performance already extremely good... I don't think we have a clear way forward yet on "layout updates" though.

janekmi commented 1 year ago

If you consider this question still important to you please reopen the issue and provide more context for your request so we can reassess its priority.