Versions, Tombstones, and Interfaces

I have a bunch of questions. I hope it's ok to post them here as an issue.

I was tasked with adding a new storage engine to DVID (Openstack Swift). From the documentation and the existing source code, I was able to gather most information needed. However, I'm still not clear on a few things, particularly:

When a versioned context is passed to a Put() function, is it my job to create a new version for the given key? If so, how? It seems that the basholeveldb implementation, for example, simply uses ctx.ConstructKey() even for a versioned context. Won't that overwrite existing keys?
My understanding is that I will write a tombstone key (with an empty value payload) in case of a versioned Delete(). Also, the key (from ctx.ConstructKey()) is actually deleted. Doesn't this mean we can't reverse the delete? In a versioned context, don't we want to keep the deleted value around?
In a versioned Get(), don't I have to check for the existence of a tombstone key and return nil if one exists? Or will VersionedKeyValue() take care of that?
In a versioned Put(), the existing implementations delete the tombstone. Is this to "undelete" a key if it was previously deleted?

Questions about interfaces:

I'm not clear on which interfaces are required for normal operation of a storage backend. The Wiki talks about OrderedKeyValueDB but is this enough? Specifically, do I need to implement storage.Batch?
What is storage.MetadataContext? Engine.NewStore() requires me to return a flag called initMetadata but I'm not sure what this value should be. Is there a definition somewhere what "metadata" is in this context, how it is used, and how it relates to a storage engine?
All the storage files have a build option at the top, e.g. // +build basholeveldb. My understanding is that this excludes these files from some builds. Is there a specific reason for this? Is there a downside to having all storage backends in a build?

I know it's a lot of questions. If you let me know how we can get in touch directly, maybe we can clarify these more quickly in a call or something. Your answers would really help me to make progress here. Thanks in advance!

That’s great that you want to add a storage engine. We are kind of focused on our particular needs for Connectomics research but I’m hoping to get back to improving the engine infranstructure as well as the metadata handling. I plan on adding the Badger storage system pretty soon as well as OrpheusDB, a versioned Postgresql. Could you tell me what company is sponsoring your work?

When a versioned context is passed to a Put() function, is it my job to create a new version for the given key? If so, how? It seems that the basholeveldb implementation, for example, simply uses ctx.ConstructKey() even for a versioned context. Won't that overwrite existing keys?

Note that leveldb.Put() accepts a storage.Context, which is an interface and happens to be fulfilled by a datastore.VersionedCtx. The datastore.VersionedCtx gets created via the datastore.NewVersionedCtx(data, version) functions within the datatype implementations. At some point, dvid has to translate the various key components (the type-specific key, the version, any tombstone marker, etc) into a byte slice suitable to be a storage.Key. This is done by storage.DataContext, which is embedded in datastore.VersionedCtx. In particular, the ConstructKey(TKey) function for a DataContext does the job. When your storage driver calls ctx.ConstructKey(), if the context is a versioned implementation of the interface (as storage.DataContext is), then the key you get back has a version ID at the end of the final key.

My understanding is that I will write a tombstone key (with an empty value payload) in case of a versioned Delete(). Also, the key (from ctx.ConstructKey()) is actually deleted. Doesn't this mean we can't reverse the delete? In a versioned context, don't we want to keep the deleted value around?

The key-value pairs from other versions will still exist since the real key (produced by ConstructKey) is different across versions. Within a version, if you delete a key-value, that key-value is gone. It’s like a working directory. Only committed versions are immutable and safe.

In a versioned Get(), don't I have to check for the existence of a tombstone key and return nil if one exists? Or will VersionedKeyValue() take care of that?

Correct, the datastore.VersionedCtx has a function VersionedKeyValue() which eventually calls the datastore manager’s findMatch() function that recursively looks through a version’s ancestors and if it finds a tombstone, it stops there, returning a nil.

In a versioned Put(), the existing implementations delete the tombstone. Is this to "undelete" a key if it was previously deleted?

For a particular version and type-specific key, you can only have either a tombstone or a valid key-value. So when doing a Put() and storing a key-value, you must make sure any tombstone marker is deleted.

I'm not clear on which interfaces are required for normal operation of a storage backend. The Wiki talks about OrderedKeyValueDB but is this enough? Specifically, do I need to implement storage.Batch?

This really depends on the data types you plan to use, e.g., keyvalue, labelmap, etc. Generally, implementing an OrderededKeyValueDB will be good enough for any data type. At one point I thought storage.Batch would get used a lot, but it’s actually used very little and probably could be removed from some data type implementations. Right now we recommend implementing it (e.g., see the gbucket driver for Google Cloud Storage). But this is something we might want to explore more to see if that dependency could be removed depending on the data types you want to use.

What is storage.MetadataContext? Engine.NewStore() requires me to return a flag called initMetadata but I'm not sure what this value should be. Is there a definition somewhere what "metadata" is in this context, how it is used, and how it relates to a storage engine?

Some metadata needs to be stored that tells dvid about the repos, DAGs, data instances (names, local IDs, etc). All this data gets stored in unversioned key-value pairs in one of the storage engines. The storage.MetadataContext is a way for dvid to describe this key space that is separate from the storage.DataContext key space.

When dvid starts up it either creates a new instance of the storage backend (e.g., if a leveldb didn’t exist before a dvid started, it will after it starts) or opens the pre-existing database. For a cloud service like Google Cloud Storage (gbucket driver), this is somewhat different and I imagine it’s similar with Openstack Swift. The initMetadata bool tells dvid if this is newly created and needs to have metadata stored in it. Currently, all our storage engines can act as metadata stores but it doesn’t have to be that way. In the future, we may actually setup a log-based method of storing metadata. There are some advantages, though, for putting all the data, including metadata, in a store.

All the storage files have a build option at the top, e.g. // +build basholeveldb. My understanding is that this excludes these files from some builds. Is there a specific reason for this? Is there a downside to having all storage backends in a build?

Just bloat and possibly some storage engines would only work under certain environments. For example, you could have a super-duper cluster-necessary store that wouldn’t build on a MacBook Pro.

Thank you so much for this detailed reply! This really helps me figure out how to structure my code.

Regarding versions, this in particular is a crucial point:

Only committed versions are immutable and safe.

Not being a user of DVID myself, my impression was that every Put() would result in a new version of that key. But if there needs to be a commit and DVID itself moves up to a new version, then that makes things much easier.

Or will VersionedKeyValue() take care of that?

Correct

This function actually takes a KeyValue and not just a Key. Does it require the value to be filled with the actual data? Currently, I don't initialize the value but load it later, when the correct key has been chosen. (This seems more efficient to me than to fully load all the versions and throw away the ones that are not needed.)

Right now we recommend implementing it [storage.Batch]

Ok. Is there a requirement that batches are transactional? I.e. if one of the write/delete actions fails during a commit, do I need to ensure that the batch rolls back to its previous state? This would also have implications on running multiple operations/batches at the same time. To my knowledge, Swift doesn't come with transactions built-in so if a batch commit needs to be atomic, I'm facing quite a challenge here.

You can find my code here: https://github.com/HumanBrainProject/dvid/tree/hbp-swift-bucket/storage/swift

They would eventually like to see this become part of DVID so at some point, when testing has finished, I will probably send you a pull request.

Only committed versions are immutable and safe.

Not being a user of DVID myself, my impression was that every Put() would result in a new version of that key. But if there needs to be a commit and DVID itself moves up to a new version, then that makes things much easier.

DVID operates more like the git version control system for software where an uncommitted node is a “working directory” and you can overwrite things until satisfied, then you commit the version. This is because the size of the data and having to record provenance for each operation, say writing a voxel, is unnecessary. We’re more interested in capturing the state of a data volume rather than the individual state of each voxel.

That’s not to say, though, that you can’t build in more granular versioning within a single dvid version. For example, you could store the merge and split history of supervoxels for a given version, so you could rewind any particular operation. We do this for our newer “labelmap” data type where supervoxels are atoms and the history of merges and “cleaves” (splitting of a region based on supervoxel boundaries) is stored in an append-only log.

Or will VersionedKeyValue() take care of that?

Correct

This function actually takes a KeyValue and not just a Key. Does it require the value to be filled with the actual data? Currently, I don't initialize the value but load it later, when the correct key has been chosen. (This seems more efficient to me than to fully load all the versions and throw away the ones that are not needed.)

I think you are correct when dealing with certain kinds of storage engines, so maybe we should rewrite this function. The initial leveldb datastore moves key-value pairs in tandem, so loading and then manipulating a pointer to the pair doesn’t cost much. However there are other databases like Badger that store the keys separately from the values, and in this case it would make sense to just check keys.

Right now we recommend implementing it [storage.Batch]

Ok. Is there a requirement that batches are transactional? I.e. if one of the write/delete actions fails during a commit, do I need to ensure that the batch rolls back to its previous state? This would also have implications on running multiple operations/batches at the same time. To my knowledge, Swift doesn't come with transactions built-in so if a batch commit needs to be atomic, I'm facing quite a challenge here.

You can find my code here: https://github.com/HumanBrainProject/dvid/tree/hbp-swift-bucket/storage/swift

They would eventually like to see this become part of DVID so at some point, when testing has finished, I will probably send you a pull request.

The original intention was that storage.Batch was transactional and for some stores, like leveldb, you can satisfy that. Note that storage.Batch only batches Delete/Put and not Get. If you look at how batches are really used in the datatypes, though, (do a search for NewBatch(ctx)), it’s a way of accelerating storage by loading lots of small changes into a bigger change. Frankly, I think this is one aspect of dvid that could use some refactoring.

For now, you could implement without rollback and we can see which datatypes are fine with that approach. Ideally, each datatype would be able to say which API endpoints are available by checking which interfaces a particular storage engine satisfies. Most of the endpoints don't need storage.Batch interface. This kind of "polyglot persistence" awareness should be baked into the datatypes, and some datatypes might be highly specialized and require a particular class of storage, e.g., a versioned SQL like OrpheusDB.

I would agree that we should explicitly state the properties a storage engine must satisfy to fulfill given interfaces like OrderedKeyValueDB and Batch.

Oliver, could you send me an email at katzw at-sign janelia dot hhmi dot org?

janelia-flyem / dvid

Versions, Tombstones, and Interfaces #250