Is there scan (iterator) support?

thompsonbry commented 2 days ago

Looking at the API, I do not see support for ordered traversal? Am I missing something?

laurynas-biveinis commented 1 day ago

It's on my TODO. For the single-threaded version, it's easy to implement. For the concurrent version, what semantics do you expect in the case of iterated node jumping around the tree?

The version of this data structure with STL-like API supports iterators: https://github.com/justinasvd/art_map

thompsonbry commented 1 day ago

My use case is ART providing a secondary index in a hybrid columnar database which uses a variant of fast mvcc (per a different Neumann publication) internally to have a coherent view of tuples stored in a heap. As a secondary index, it is not responsible for transactional isolation. That occurs when we indirect to the tuples and look at the version information (if any) on the tuple and decide visibility for the transaction.

In this scenario, the secondary index merely needs to provide read after write. Any changes by another transaction would not be visible when we look at the version metadata on the tuples. If a transaction is using multiple threads to apply updates, then there is clearly a danger to observing only partial information, but I think that this question needs to be resolved by appropriate barriers at a different level in a database system.

So to me the question comes down to how ordered traversal is being achieved. There I am not yet familiar enough with ART or with your implementation to know the answer, but I am happy to discuss what might be required to achieve a suitable traversal semantics.

As I understand it, you are using what I would call an epoch drive garbage collector (QSBR). So if there is a node replacement, then the old data remains visible to other threads until they exit the epoch, correct? So one case (structural modification) is handled by that, which is that a thread can still see the older state of the world during its traversal. So I am not sure how "the node jumping around the tree" causes an issue?

I lack clarity about how ordering is maintained within the nodes within the concurrent ART implementation. Can you explain that in more depth or reference me into the appropriate logic in the code? Is this more or less the same as described in the "syncart" publication?

laurynas-biveinis commented 1 day ago

A quick partial reply: naively I'd implement ordered traversal by storing a pointer to the current node and the current key. At a higher level this wouldn't differ much (or at all) from how one would iterate a B-tree, or a trie. QSBR would indeed ensure that the iterator node pointer would point to a valid node, but potentially not the latest version of it.

I need to work out an example of "jumping around the tree", or confirm its absence.

Concurrent ART maintains ordering in the nodes straightfowardly as the paper describes. The optimistic lock primitive works as a regular mutex as far as writes are concerned: you lock, modify the node from one consistent state to another, & unlock

thompsonbry commented 1 day ago

Ok. That sounds straightforward.

I could also see issues if you want to reconstruct the prefix for a specific key. That needs some coherent path. Maintaining a stack of references for the current path would be sufficient, but might be overkill.

laurynas-biveinis commented 1 day ago

This reminds me, nodes don't have parent nor sibling pointers, thus iterators must maintain the whole path from the root instead of the current node only anyway. I suppose you could re-traverse using the current iteration key from the root every time you need to change the node, but I am not sure it's a good trade-off

thompsonbry commented 1 day ago

If you have the stack of the traversal, then everything for the path to the current leaf should be pinned by the epoch system and prefix reconstruction for the current visited key should be trivial.

I agree that re-traversal from the root for next() is a Bad Idea.

Do you have version information in the nodes? The main use of that I think would be if you have multiple threads that were coordinating a read/write workload for a single transaction and you wanted to see the most recent writes in the secondary index. To me, that sort of fine-gained multi-writer / multi-reader visibility in the traversal is not a requirement and I wonder about the additional synchronization overhead that might result.

laurynas-biveinis commented 1 day ago

Wait, I'm hallucinating "multiple versions" of the nodes. There are no multiple versions, this is not COW. Nodes are guaranteed to exist and be always OK to read between Q states, but the reads might fail in that the node has been updated meanwhile, as indicated by the version bump in the read "critical section". Point queries, both in the paper and in the implementation, restart upon this. The paper discusses upgrading read locks to write locks if there are too many restarts to prevent starving.

The optimistic lock, which every OLC node has, has the version counter and marked-for-deletion bit.

thompsonbry commented 1 day ago

During a structural mutation in which a node is replaced, you already have a node and it can be visible to any thread which has entered epoch protection, right? Or I am off base here.

If this is true, then the old node remains visible to threads which have already dereferenced it until all such threads have exited the epoch and epoch deferred GC can reclaim the allocation.

If that is not what is going on, then I am off base and need to look at things more closely. And I wonder how replacing nodes is safe when a reader may already have a pointer to the node and be looking at its data.

The pattern is only locking for writers? Or am I off base there?

thompsonbry commented 1 day ago

Sure. You can fail if the version is bumped, but does a reader need to fail for a version bump or just a writer?

That is, is the read view protected through the epoch mechanism?

What actions bump the node version? Is that done for mutation in the node? In which case a reader would need to restart within the node to find the next key in sequence?

thompsonbry commented 1 day ago

One other possibility is to make a copy of the node into a stack allocation. Load acquire the version tag before the copy and after the copy. If unchanged, it is a valid point in time copy of the data.

But this introduces more data movement than I would like. I suspect there is a better way and I just need to get further into what is actually going on in the implementation.

thompsonbry commented 1 day ago

I was not suggesting that any mutation forced a copy. But a structural modification creates a new node and the data are copied into the new tree structure, right? But that is also the case where epoch protection would ensure that a reader looking at the old node would see valid (stale, but valid) data. Because the GC of that old node must be deferred until there is no longer any thread in the critical section — with active access to the old node allocation.

laurynas-biveinis commented 1 day ago

At times there are multiple nodes: when insert or deletion causes the node size change to the next type (4 <-> 16 <-> 48 <-> 256). The old node is guaranteed to be accessible until Q state. Other inserts and deletes are in-place. So structural modifications are safe in regard to node access.

Readers fail if a writer has bumped the version in the middle of a read. Writers don't fail because they lock exclusively. Node modifications are finished with a write unlock, which bumps the version. Reads must restart if writes happened meanwhile, it's hard to describe this less vaguely without the actual algorithm, basically, check for restart in every place at which you are no longer consistent if a write happened to the same node in parallel.

thompsonbry commented 1 day ago

Ok. That all makes sense. So the question comes down to "what does a reader restart mean" during a traversal? And how often does a reader need to check for a restart?

If we copy out the node contents and ensure that the version was not bumped, then we have a valid copy and can just use it. But this is more expensive than would be ideal since the copy is always made.

You definitely need to load the version tag before reading on a node. Since that is required to detect a version change.

So I suspect the pattern is to load the version tag again after each read (of a key or a child pointer). If the version tag is unchanged, the read is valid.

For the scan, you can proceed normally until the version tag changes under you. At that point, you need to "restart" in some fashion. That restart needs to be at the next key after the last key read by the scan. So a simple algorithm would be to do a search for the last key (which we might have to reconstruct from the path to the current slot), load the version tag in whatever node you land on, then look for the next slot in key order in that node, read it, check the version tag, and so forth.

thompsonbry commented 1 day ago

Without buffering the visitation output, we can't proceed speculatively to read more data before rechecking the version tag. So we need to recheck the version tag after reach read operation which has a side effect visible to the consumer of the scan. So we could find the next key and then get the child pointer and only then recheck the version tag because nothing in between was visible to the consumer of the iterator.

laurynas-biveinis commented 1 day ago

For point lookups, reader restart means full restart from the top. For scans, if we save path from the top, a partial restart to the lowest unchanged level might suffice. Your algorithm outline makes sense too. I don't think it's worth copying nodes - directly accessing the needed parts, checking if they are consistent and proceeding should be faster anyway than copying, checking if the copy was consistent, and then proceeding.

As for how often the version needs to be checked during reads, for a representative example check Figure 3 https://db.in.tum.de/~leis/papers/artsync.pdf right hand side. The implementation matches this pseudocode (which is a compliment for the paper, as the research paper pseudocode usually omits crucial details. Not here.) Perhaps a late parent version check there might be counterintuitive, but I can confirm it's needed for correctness.

thompsonbry commented 1 day ago

You mean this line?

nextNode = node.findChild(key[level])
checkOrRestart(node, version) <====

That makes sense to me. I would expect to see a version check there. We have found a child pointer. We need to make sure that the information remains valid by checking the version tag.

laurynas-biveinis commented 1 day ago

No, line 4: readUnlockOrRestart(parent, versionParent)

thompsonbry commented 1 day ago

That's curious. I guess it is protecting against your having traversed to a child node when the parent node was modified and hence the child node might not be the right node anymore.

I would have thought that epoch protection would have made that Ok without the check.

laurynas-biveinis / unodb

Is there scan (iterator) support? #601