Closed johnkerl closed 1 year ago
This got a bit derailed by other things but here is a proposal:
Adding some measure of “state” to the SOMA API can allow us to solve two related problems:
Array
handle when working with a remote collection saves reopening the array, and in turn one or more round trips for every single operation.When a SOMA object is opened, it records a token that represents the configuration at the time it was opened. For example, for TileDB this would be the timestamp at which the object was opened. An implementation which stored SOMA data in a version control system might use the current revision as the timestamp. When “child” SOMA objects are accessed through that parent object (e.g. if a collection is opened, and a child dataframe is accessed), the child is then opened with that same timestamp.
The opened SOMA object may also maintain a long-lived handle to its “actual” backing storage, in whatever form that takes. For instance, a TileDB-based backend could keep an array opened, or a filesystem-based backend could keep a file opened.
Using this token may not necessarily guarantee a fully-consistent view of the data, but is intended to provide improved consistency suitable for most use cases. (This is similar to the way TileDB VCF currently works—it uses the same timestamp across multiple arrays to ensure a consistent view as far as the storage backend guarantees it.) It is not intended to guard against adversarial users or pathological cases, and cannot provide any guarantees stronger than those provided by the storage engine.
In these examples, the timestamp listed at the start of the action represents wall time at which an action happened, whereas “with/at a timestamp of X” represents the logical timestamp within the storage engine.
✔ Cooperative readers/writers
In this case, user B will see consistent data in their dataset, as if the write at 10:40 never occurred.
❌ Highly concurrent operation
User B is not guaranteed to see consistent data in their dataset, since User A was writing data timestamped before 11:35 as of 11:40.
❌ Adversarial/broken writers
User B is not guaranteed to see consistent data in their dataset, since User C inserted data that appears to have come from before its open timestamp.
An implementation should provide the strongest consistency guarantees that it is able to. The above examples are a guideline for what a user should be able to expect, intended to provide reasonable and useful levels of consistency without imposing onerous requirements of the implementation itself. However, since this is overall a best-effort system, an implementation which offers zero guarantees and completely ignores these tokens is valid (though not particularly useful).
When a SOMA object is opened, the current timestamp (or some other token representing the current state of the data) is recorded. “Opening” a SOMA collection need not be an active process; an implementation can lazily create the handles to the stored data itself when data is requested, but doing so should be semantically equivalent to having opened the underlying object in the state it was in as of the original open call.
In many programming environments (C/C++, Python, etc.), when a file handle is created it is implicitly opened. Similarly, when a SOMA object is created, it is implicitly considered opened. When accessing a child object through a parent object (e.g. accessing a DataFrame stored in a Collection), the timestamp should be passed through to the child object, so that all the objects in the collection are opened with the same consistent state.
When a SOMA object is opened, the current timestamp (or some other token representing the current state of the data) is recorded.
Is this the TileDB-provided timestamp that is used to perform TileDB-supported time traveling? Or is it a client-determined value? I'm assuming that since a given Array's timestamp may not be useful across multiple Arrays, that a client-determined value is needed. And I think that's what you're getting at with the "wall time" vs "logical timestamp" distinction in the consistency guarantee examples, above. Either way, it would be helpful for clarity if the proposal nailed that down.
so that all the objects in the collection are opened with the same consistent state.
That actually seems worth listing under the rationale section, right? Having consistent reads across all grouped TileDB arrays seems important.
I'm not clear on whether TileDB provides a single timestamp that can be used across an entire group of arrays, or if that is array-specific. If the latter, would that imply that the API needs to determine each TileDB array's current timestamp at open() time so that it can be used if/when a given array is acceessed?
This looks reasonable from a reader perspective - ie., read-time consistency via a token/handle/timestamp.
We also need the same thing from the writer viewpoint, to handle the case where a writer wants to perform a multi-write operation, which succeeds all-or-nothing. This requires a slight extension to this proposal:
Is this the TileDB-provided timestamp that is used to perform TileDB-supported time traveling? Or is it a client-determined value? I'm assuming that since a given Array's timestamp may not be useful across multiple Arrays, that a client-determined value is needed. And I think that's what you're getting at with the "wall time" vs "logical timestamp" distinction in the consistency guarantee examples, above. Either way, it would be helpful for clarity if the proposal nailed that down.
Yes, the “logical timestamp” is a client-provided value. For writing, it determines the point in time at which the write is recorded as having been performed. A single write operation (which can mutate many values on an array) either fully succeeds or fully fails. For reading, the timestamp is client-determined; every write that happens “before” then is included. The reader in this case would choose the current wall time and does not need to look at the array to do so. In the examples I was trying to communicate that different users might have clocks that are slightly off from “real” wall time but that may not have been quite as clear as I had hoped.
tl;dr: I believe the specification/implementation that I’m proposing does address both of your concerns, just that my explanation wasn’t quite clear on that. It is possible that there is something I missed here so do let me know if you have more questions.
perform a multi-write operation, which succeeds all-or-nothing
Unfortunately TileDB only supports this on a per-array basis and I don’t see a way to accomplish this in general across multiple arrays (or that we could specify this behavior in general across other SOMA implementations).
- "Open" has a mode - read or write
- an explicit "close" operation, allowing the underlying implementation to commit the results
Oops! This is something I was thinking about when I started drafting this, but it fell out of my brain as I got into the weeds of other stuff. All of this is roughly what I was thinking.
Unfortunately TileDB only supports this on a per-array basis
Very aware -- I was focused on single-object consistency. We can sort out multi-array/group consistency at a later date, and layer it with an API revision.
@thetorpedodog @johnkerl @aaronwolen @eddelbuettel @gspowley @Shelnutt2 @bkmartinjr @atolopko-czi
Hi all, as discussed, initial implementation made me nervous about how having open(mode,timestamp)+close on each SOMA object allows the developer to conjure states with tricky edge cases -- e.g. a collection and its members open in different modes and/or different timestamps, with these states changing as the program executes. I'd like to outline a simpler approach (either as a proposal or a strawman, depending on how it's received :sweat_smile:).
I understood the cases we really want to cover with timestamps are:
I believe the reading case is served adequately by supplying timestamp as a property of the session-wide context (as in #644 #681). One would just initialize the desired read timestamp (or timestamp interval) in SOMATileDBContext and provide that to the SOMAExperiment constructor; all objects accessed through the Experiment would then naturally inherit this context, and you can't easily create a situation where different objects are set to read from different timestamps.
For writing, a similar mechanism could apply, with the caveat that if you create an object before adding it to a collection, then you need to supply the desired context when initializing/creating the object, because there's no parent to inherit it from at that point. (See for example this series of ops in cell_census_builder for an example where the TileDB_Ctx is expressly fed in to each SparseNDArray initializer, and only afterwards added to measurement.) Note the same caveat would exist using timestamps with open() for the same reason.
Thus I don't think open(timestamp)/close are needed to achieve the desired consistency properties, and the model is simpler without them. There's still the other use case for open/close, to keep handles/metadata open for a series of operations. But that would be an optional performance enhancement only.
Thanks for the summary. A couple of minor thoughts:
atomic
seems more than required. What we really need is ability to create or update the contents of a SOMACollection/SOMAExperiment, such that use case 1 is true (ie, the collection & collection contents are consistent for all readers). I don't think the above points substantially affect your proposal. Assuming that is true, I"m good with the simplified proposal.
@mlin Unless I'm missing something I don't believe we stabilized on whether there would be an explicit exp.open()
for reads _and_writes, or, implicit open for read at constructor time.
There are performance benefits to be had for the latter, and I believe it would be preferable to get explicit exp.open()
(as a required syntax) into the API for 1.0 rather than introducting it later.
@johnkerl https://github.com/single-cell-data/TileDB-SOMA/pull/730 is what open/close would look like purely for the performance-enhancing reuse of handle across multiple read or write ops (excluding anything to do with timestamps, as if we were using my simplified idea to handle those)
Note, the c++ SOMAReader
has it's own copy of the array handle, it does not share the python array handle. The SOMAReader
opens the array at constructor time and holds the array open. So, we need to keep one SOMAReader
object instantiated per array to hold that array open for reads.
While holding the array open, we need to reset the state of the SOMAReader
object before each new query.
An example is shown in this test: https://github.com/single-cell-data/TileDB-SOMA/blob/main/libtiledbsoma/test/test_soma_reader.py#L257-L284
This has been implemented, so we are going to close this Issue. A new one has been opened for documenting the design.
Make a summary proposal for SOMA