single-cell-data / SOMA

A flexible and extensible API for annotated 2D matrix data stored in multiple underlying formats.
MIT License
72 stars 10 forks source link

Baseline of proposal for durable/ephemeral and everything else #48

Closed thetorpedodog closed 1 year ago

bkmartinjr commented 1 year ago

I like the general direction this is headed. A couple of general questions:

  1. Where/how would the mapping between durable collections and the storage namespace be managed? Would this be part of the DurableCollection.create() method? In particular, I'm interested in the following cases:
    • Specifying the URL for any given object (currently the storage identity of an object and its "name" in the collection are independent specified, and I think you still need that for some storage sub-systems, e.g., TileDB-Cloud).
    • The ability to use relative names at the storage level, so that I can create a "movable" collection of objects (eg, I can cp the entire collection without renaming anything inside it).
  2. Context - is this also where other "storage engine configuration" would live, e.g., soma.init_buffer_bytes, num threads, etc?
  3. The primary "protocol" enforced by the composed types (experiment, measurement) is existence of certain names, and the data types which may be assigned those names (side note: ideally we would also be able to enforce additional constraints, such as the dimensions of child objects). E.g. experiment.obs must be a dataframe.
    • I am not clear from the proposal how the type constraints are enforced. Can you clarify?
    • It seems like the setitem for Experiment/Measurement need to "validate" that the key/value pair are legal. Is the intent that this would be hooks to allow this?
  4. Reification:
    • When a durable collection, containing an experiment/measurement, is opened (from persistent storage), would it be automatically promoted to Experiment/Measurement type? Are there any sharp edges around this? Or is it up to the user to open a collection, identify its "specialized type" and cast it?
    • Same question for other durable types (dataframe, ndarray) - if they are "opened", is their type automatically determined/available, or are they presented as some sort of generic object that needs to be composed by the user?

Other minor stuff:

bkmartinjr commented 1 year ago

Additional questions/thoughts:

johnkerl commented 1 year ago

Thanks @thetorpedodog !! :) I have no additional thoughts beyond @bkmartinjr 's excellent questions. :)

thetorpedodog commented 1 year ago

So: working on open/close has made what I’m going for a lot clearer to me. In the last couple days of the creative process I realized that writing examples of how I expect the API to be used would be more helpful to me (and hopefully better for discussion) than trying to write abstract spec-style descriptions of what a method or object would do.

I would say that this proposal is now more about the separation of concerns between composed collection behavior and collection storage than it is about durable vs. ephemeral (though separating the behavior from the storage enables the creation of ad-hoc ephemeral collections).

I jumped around in answering the questions below so some things might only make sense after reading later parts.

  1. Where/how would the mapping between durable collections and the storage namespace be managed? Would this be part of the DurableCollection.create() method? In particular, I'm interested in the following cases:

    • Specifying the URL for any given object (currently the storage identity of an object and its "name" in the collection are independent specified, and I think you still need that for some storage sub-systems, e.g., TileDB-Cloud).
    • The ability to use relative names at the storage level, so that I can create a "movable" collection of objects (eg, I can cp the entire collection without renaming anything inside it).

What I'm going for here is that in most cases, object creation is done in a mostly top-down fashion. To create a new stored SOMA object, the user creates it with the "smart" somaimpl.create(...) method, which, given a complex SOMA type, creates the appropriate storage object and gives it the appropriate fields, and wraps this storage object (for TileDB, the Group or Array) in whatever complex somabase type it represents. This complex somabase type then has methods to create and manage its children as needed. Calling one of those in turn calls back into the storage engine to create the backing storage for the child object, then returns a wrapped object.

  1. Context - is this also where other "storage engine configuration" would live, e.g., soma.init_buffer_bytes, num threads, etc?

Yes. This would be the thing that you set up once on a per-session basis and is used through multiple queries. The TileDB implementation would probably keep its tiledb.Ctx object loaded on this context. (The contents of a Context are defined entirely by the storage engine.)

  1. The primary "protocol" enforced by the composed types (experiment, measurement) is existence of certain names, and the data types which may be assigned those names (side note: ideally we would also be able to enforce additional constraints, such as the dimensions of child objects). E.g. experiment.obs must be a dataframe.

    • I am not clear from the proposal how the type constraints are enforced. Can you clarify?
    • It seems like the setitem for Experiment/Measurement need to "validate" that the key/value pair are legal. Is the intent that this would be hooks to allow this?

The assumption I am making is that on the stored Collection or Array (DataFrame/NDArray), there is a metadata value which indicates its soma_type that the implementation can look at for what "specialized type" is so that it knows what to return. This enforces (on some level) that the "specialized type" we load is correct for things we load from storage.

For performing __setitem__-style indexing, there are essentially two layers of processing to do:

  1. The complex type (e.g. Experiment, Measurements) has to check that the soma_type of the newly-added child is what it expects it to be (this would be done in the complex type). I could call this the "semantic check".
  2. The storage engine needs to verify that it is capable of adding this item to its collection via __setitem__. (For instance, is it a SOMA object from the same storage engine, does it meet other requirements like being stored in the same place (hypothetically)). This could be the "implementation check".

What I am anticipating is that the amount of __setitem__ will actually be fairly low, since in most cases building out a collection (whether it be an experiment or measurements or other collection type) is done with the "top-to-bottom" strategy of creating the complex type root and then creating branches and leaves recursively within that.

This also interacts with the "ownership" structure and how we handle "open" state—__setitem__ might allow a user to pass in an object opened in a different mode or with a different timestamp, and we would have to figure out how to regularize that and also handle object ownership. (This comes back to where I have been considering a protobuf-style ownership model.)

  1. Reification:

    • When a durable collection, containing an experiment/measurement, is opened (from persistent storage), would it be automatically promoted to Experiment/Measurement type? Are there any sharp edges around this? Or is it up to the user to open a collection, identify its "specialized type" and cast it?
    • Same question for other durable types (dataframe, ndarray) - if they are "opened", is their type automatically determined/available, or are they presented as some sort of generic object that needs to be composed by the user?

I think I have covered these with previous (and immediately following) discussion of how somaimpl.open would work. tl;dr an object in storage stores what type it is, and when opened with the "smart" opener, it returns the appropriate object type.

Other minor stuff:

  • it might be simpler for DurableCollection to have a separate create method for each soma data type (rather than consolidating all "array" creators into one type, etc).

This is where the central create method comes into play, since that is where the semantic knowledge of "how do I create a storage object for this complex type" exists.

  • Am I correct that an implication of this is that we will no longer support stand-alone DataFrame/NdArray outside of a DurableCollection? (that makes some sense, but wanted to confirm & highlight)

My intent here is that someimpl.open(...) will read the object's metadata and return the appropriate complex object type. If the stored object's metadata indicates that its type is DataFrame or NDArray (presumably we store this in the soma_type metadata entry), someimpl.open would know to return the stored data as a DataFrame/NDArray as needed. This would work the same way as opening an Experiment by URI—the engine would check its metadata, see that it is an experiment, and return it as an Experiment.

  • The model implies that a DataFrame/NdArray must exist in one-and-only-one DurableCollection (and zero-or-more ephemeral collections)?

Essentially, yes—a DataFrame or NDArray is "owned" by some object (which will usually be some Collection, but could be itself), but a user could put it into any number of ephemeral collections since those are not backed by any storage.

thetorpedodog commented 1 year ago

I should also note that the way create and open work are not set in stone. I have been thinking of this as something that works similarly to the way that built-in open()-type calls work, where you give it the path (URI, whatever) to open and it gives you the object back. The reason I prefer that to an interface that works like my_exp = Experiment(some_uri, ...); my_exp.create(...) is that it essentially ensures that when you're working with a SOMA object, it is one that you can know is valid.

This is not to preclude that other model—it would not be a huge adjustment and I can see how others might prefer it.

bkmartinjr commented 1 year ago

now more about the separation of concerns between composed collection behavior and collection storage than it is about durable vs. ephemeral

this resonates (positively) with me - it feels like the core concern.

Reading the notes above - all sounds good. The only edge case you should consider is where you want to perform a delayed addition to a collection or move dataframe/ndarray between collections. I believe most of these are easy, but helpful to clarify, e.g.,

Pretty sure most of these are already considered - just worth clarifying.

it essentially ensures that when you're working with a SOMA object, it is one that you can know is valid.

Big thumbs up. I prefer this over our current model, which requires that the user protect code with exists(), or the equivalent - it essentially adds a hidden bit of state.

That said, I believe the reason it was done this way is to allow for hierarchical collection such as the current "walk the entire tree" repr using the same API. Not hard to work around that as I understand the typical use cases (eg., open an collection and do an ls -lR on it)

thetorpedodog commented 1 year ago

Reading the notes above - all sounds good. The only edge case you should consider is where you want to perform a delayed addition to a collection or move dataframe/ndarray between collections. I believe most of these are easy, but helpful to clarify, e.g.,

  • when can I move an ndarray/dataframe from one collection to another? Is there a hidden "storage manager" object that defines when this is possible?

This would be handled by the assignment process, so doing some_collection[name] = element would perform the action. How the collection would handle being assigned to is implementation-dependent. The collection assigned to could:

  1. reject assignments altogether
  2. add a reference to the object to its backing store (i.e., both now point to the same data on disk)
  3. copy the data from the object to its backing store (i.e. the new element points to a new array/whatever with the same data as the original)

The new element that is at some_collection[name] would likely not be the same object as element, since that would cause confusion with respect to opening status and context dependence. That is, in either of the latter two cases, the result would look something like:

source_dataframe = vsoma.open("file:///path/to/some/dataframe")
# source_dataframe is opened for reading.

some_collection = vsoma.open("file:///path/to/different/collection", mode="w")
# some_collection is opened for writing.

some_collection["new-data"] = source_dataframe
# source_dataframe is added to some_collection as "new-data".

new_data_dataframe = some_collection["new-data"]
# we extract the actual "new-data" element that is available in some_collection.

print(new_data_dataframe is source_dataframe)
# -> False
# new_data_dataframe is a version of source_dataframe data opened under the
# context of some_collection. Both would contain the same data, though whether
# the two collections point at the same collection in storage may be
# implementation-dependent.

print(new_data_dataframe.uri == source_dataframe.uri)
# -> If the implementation supports option 2 (adding by reference), this would
#    be True.
# -> If the implementation only supports option 3 (adding by copy), this would
#    be False.

We could also specify that assigning elements to a collection is necessarily done with a reference rather than a copy, and that if assigning would require a copy (e.g. the shape is filesystem-bound), then the storage engine should reject assignments.

  • can I put a dataframe/ndarray into more than one collection?

Yes, subject to the constraints of the constraints of the storage engine. In the case I am describing above you will (probably) end up with two separate handle objects on the data itself. You could also (and this is where SimpleCollection comes back into play) add any dataframes/arrays/whatever to an ad-hoc in-memory collection which always references the same objects.

  • can I open a long-existing collection, and add new dataframe/ndarray's to it?

This is hopefully explained above—you can open a collection/dataframe/whatever for writing and use that to modify the contents.

Pretty sure most of these are already considered - just worth clarifying.

it essentially ensures that when you're working with a SOMA object, it is one that you can know is valid.

Big thumbs up. I prefer this over our current model, which requires that the user protect code with exists(), or the equivalent - it essentially adds a hidden bit of state.

That said, I believe the reason it was done this way is to allow for hierarchical collection such as the current "walk the entire tree" repr using the same API. Not hard to work around that as I understand the typical use cases (eg., open an collection and do an ls -lR on it)

This would be a matter of open()ing the root for reading and walking through its members (if any), thus producing a recursive listing.

thetorpedodog commented 1 year ago

Since we've essentially made a lot of the changes proposed here, and this was effectively only ever a scratch pad, I am going to drop this change and separately work on importing our new lifecycle management changes into the spec itself.