Stateful/stateless API proposal

johnkerl commented 1 year ago

Make a summary proposal for SOMA

thetorpedodog commented 1 year ago

This got a bit derailed by other things but here is a proposal:

Stateful open/close semantics for SOMA API

Rationale

Adding some measure of “state” to the SOMA API can allow us to solve two related problems:

We want to be able to reuse the resource handles when performing numerous operations. For instance, with a TileDB-backed SOMA collection, reusing the same Array handle when working with a remote collection saves reopening the array, and in turn one or more round trips for every single operation.
We want some better guarantees of consistency within the backing data store. A user might open a SOMA collection and then perform analysis operations on it that run for an hour. If another user writes data to the store 15 minutes after the first user opens the collection, the first user still wants to view the data as it appeared at the start of their session, and not suddenly see the data change midway through their work. In the TileDB case, this is supported by timestamp support and the ability to open an array as it appeared at a specific timestamp.

Proposal

When a SOMA object is opened, it records a token that represents the configuration at the time it was opened. For example, for TileDB this would be the timestamp at which the object was opened. An implementation which stored SOMA data in a version control system might use the current revision as the timestamp. When “child” SOMA objects are accessed through that parent object (e.g. if a collection is opened, and a child dataframe is accessed), the child is then opened with that same timestamp.

The opened SOMA object may also maintain a long-lived handle to its “actual” backing storage, in whatever form that takes. For instance, a TileDB-based backend could keep an array opened, or a filesystem-based backend could keep a file opened.

Consistency

Using this token may not necessarily guarantee a fully-consistent view of the data, but is intended to provide improved consistency suitable for most use cases. (This is similar to the way TileDB VCF currently works—it uses the same timestamp across multiple arrays to ensure a consistent view as far as the storage backend guarantees it.) It is not intended to guard against adversarial users or pathological cases, and cannot provide any guarantees stronger than those provided by the storage engine.

In these examples, the timestamp listed at the start of the action represents wall time at which an action happened, whereas “with/at a timestamp of X” represents the logical timestamp within the storage engine.

✔ Cooperative readers/writers
1. 10:15–10:20: User A writes data to the dataset, with a timestamp of 10:16.
2. 10:30: User B opens the dataset, at a timestamp of 10:28.
3. 10:40: User A writes more data, with a timestamp of 10:35.
4. 11:00: User B reads more data from the same dataset in the same session, still opened as of 10:28.
In this case, user B will see consistent data in their dataset, as if the write at 10:40 never occurred.
❌ Highly concurrent operation
1. 11:30: User A starts writing data to the dataset, with a timestamp of 11:31.
2. 11:35: User B opens the dataset, at a timestamp of 11:33.
3. 11:40: User A concludes writing data to the dataset.
4. 11:45: User B reads more data from the same dataset, in the same session, still opened as of 11:33.
User B is not guaranteed to see consistent data in their dataset, since User A was writing data timestamped before 11:35 as of 11:40.
❌ Adversarial/broken writers
1. 12:00–12:05: User A writes data to the dataset, with a timestamp of 12:00.
2. 12:10: User B opens the dataset, at a timestamp of 12:08.
3. 12:15: User C writes data to the dataset, with a timestamp of 12:01.
4. 12:16: User B reads more data from the same dataset, in the same session, still opened as of 12:08.
User B is not guaranteed to see consistent data in their dataset, since User C inserted data that appears to have come from before its open timestamp.

An implementation should provide the strongest consistency guarantees that it is able to. The above examples are a guideline for what a user should be able to expect, intended to provide reasonable and useful levels of consistency without imposing onerous requirements of the implementation itself. However, since this is overall a best-effort system, an implementation which offers zero guarantees and completely ignores these tokens is valid (though not particularly useful).

Opening semantics

When a SOMA object is opened, the current timestamp (or some other token representing the current state of the data) is recorded. “Opening” a SOMA collection need not be an active process; an implementation can lazily create the handles to the stored data itself when data is requested, but doing so should be semantically equivalent to having opened the underlying object in the state it was in as of the original open call.

In many programming environments (C/C++, Python, etc.), when a file handle is created it is implicitly opened. Similarly, when a SOMA object is created, it is implicitly considered opened. When accessing a child object through a parent object (e.g. accessing a DataFrame stored in a Collection), the timestamp should be passed through to the child object, so that all the objects in the collection are opened with the same consistent state.

atolopko-czi commented 1 year ago

When a SOMA object is opened, the current timestamp (or some other token representing the current state of the data) is recorded.

Is this the TileDB-provided timestamp that is used to perform TileDB-supported time traveling? Or is it a client-determined value? I'm assuming that since a given Array's timestamp may not be useful across multiple Arrays, that a client-determined value is needed. And I think that's what you're getting at with the "wall time" vs "logical timestamp" distinction in the consistency guarantee examples, above. Either way, it would be helpful for clarity if the proposal nailed that down.

atolopko-czi commented 1 year ago

so that all the objects in the collection are opened with the same consistent state.

That actually seems worth listing under the rationale section, right? Having consistent reads across all grouped TileDB arrays seems important.

I'm not clear on whether TileDB provides a single timestamp that can be used across an entire group of arrays, or if that is array-specific. If the latter, would that imply that the API needs to determine each TileDB array's current timestamp at open() time so that it can be used if/when a given array is acceessed?

bkmartinjr commented 1 year ago

This looks reasonable from a reader perspective - ie., read-time consistency via a token/handle/timestamp.

We also need the same thing from the writer viewpoint, to handle the case where a writer wants to perform a multi-write operation, which succeeds all-or-nothing. This requires a slight extension to this proposal:

"Open" has a mode - read or write
an explicit "close" operation, allowing the underlying implementation to commit the results

thetorpedodog commented 1 year ago

Is this the TileDB-provided timestamp that is used to perform TileDB-supported time traveling? Or is it a client-determined value? I'm assuming that since a given Array's timestamp may not be useful across multiple Arrays, that a client-determined value is needed. And I think that's what you're getting at with the "wall time" vs "logical timestamp" distinction in the consistency guarantee examples, above. Either way, it would be helpful for clarity if the proposal nailed that down.

Yes, the “logical timestamp” is a client-provided value. For writing, it determines the point in time at which the write is recorded as having been performed. A single write operation (which can mutate many values on an array) either fully succeeds or fully fails. For reading, the timestamp is client-determined; every write that happens “before” then is included. The reader in this case would choose the current wall time and does not need to look at the array to do so. In the examples I was trying to communicate that different users might have clocks that are slightly off from “real” wall time but that may not have been quite as clear as I had hoped.

tl;dr: I believe the specification/implementation that I’m proposing does address both of your concerns, just that my explanation wasn’t quite clear on that. It is possible that there is something I missed here so do let me know if you have more questions.

perform a multi-write operation, which succeeds all-or-nothing

Unfortunately TileDB only supports this on a per-array basis and I don’t see a way to accomplish this in general across multiple arrays (or that we could specify this behavior in general across other SOMA implementations).

"Open" has a mode - read or write

an explicit "close" operation, allowing the underlying implementation to commit the results

Oops! This is something I was thinking about when I started drafting this, but it fell out of my brain as I got into the weeds of other stuff. All of this is roughly what I was thinking.

bkmartinjr commented 1 year ago

Unfortunately TileDB only supports this on a per-array basis

Very aware -- I was focused on single-object consistency. We can sort out multi-array/group consistency at a later date, and layer it with an API revision.

mlin commented 1 year ago

@thetorpedodog @johnkerl @aaronwolen @eddelbuettel @gspowley @Shelnutt2 @bkmartinjr @atolopko-czi

Hi all, as discussed, initial implementation made me nervous about how having open(mode,timestamp)+close on each SOMA object allows the developer to conjure states with tricky edge cases -- e.g. a collection and its members open in different modes and/or different timestamps, with these states changing as the program executes. I'd like to outline a simpler approach (either as a proposal or a strawman, depending on how it's received :sweat_smile:).

I understood the cases we really want to cover with timestamps are:

Consistent reading: open/initialize the top-level SOMAExperiment with a given timestamp, and all accessed elements (sub-collections and arrays) automatically inherit that timestamp for reading
Happy-path-atomic writing: for some logical batch of writes, give them all the same timestamp -- so that there exists no timestamp a reader could later specify that would give them an incomplete view of that batch. ("Happy-path" meaning the assumption that no errors occur during writing, in the absence of true group-commit feature.)

I believe the reading case is served adequately by supplying timestamp as a property of the session-wide context (as in #644 #681). One would just initialize the desired read timestamp (or timestamp interval) in SOMATileDBContext and provide that to the SOMAExperiment constructor; all objects accessed through the Experiment would then naturally inherit this context, and you can't easily create a situation where different objects are set to read from different timestamps.

For writing, a similar mechanism could apply, with the caveat that if you create an object before adding it to a collection, then you need to supply the desired context when initializing/creating the object, because there's no parent to inherit it from at that point. (See for example this series of ops in cell_census_builder for an example where the TileDB_Ctx is expressly fed in to each SparseNDArray initializer, and only afterwards added to measurement.) Note the same caveat would exist using timestamps with open() for the same reason.

Thus I don't think open(timestamp)/close are needed to achieve the desired consistency properties, and the model is simpler without them. There's still the other use case for open/close, to keep handles/metadata open for a series of operations. But that would be an optional performance enhancement only.

bkmartinjr commented 1 year ago

Thanks for the summary. A couple of minor thoughts:

Use case 1 - we have the same requirement for other SOMACollection's - not just SOMAExperiment. For example, the cell census needs to be able to ensure a SOMAExperiment and some other random DataFrames are consistently read. So I would prefer that we generalize it to "SOMAExperiment or other multi-level SOMACollection".
I think we can weaken case 2 above, slightly - the word atomic seems more than required. What we really need is ability to create or update the contents of a SOMACollection/SOMAExperiment, such that use case 1 is true (ie, the collection & collection contents are consistent for all readers).

I don't think the above points substantially affect your proposal. Assuming that is true, I"m good with the simplified proposal.

johnkerl commented 1 year ago

@mlin Unless I'm missing something I don't believe we stabilized on whether there would be an explicit exp.open() for reads _and_writes, or, implicit open for read at constructor time.

There are performance benefits to be had for the latter, and I believe it would be preferable to get explicit exp.open() (as a required syntax) into the API for 1.0 rather than introducting it later.

mlin commented 1 year ago

@johnkerl https://github.com/single-cell-data/TileDB-SOMA/pull/730 is what open/close would look like purely for the performance-enhancing reuse of handle across multiple read or write ops (excluding anything to do with timestamps, as if we were using my simplified idea to handle those)

gspowley commented 1 year ago

Note, the c++ SOMAReader has it's own copy of the array handle, it does not share the python array handle. The SOMAReader opens the array at constructor time and holds the array open. So, we need to keep one SOMAReader object instantiated per array to hold that array open for reads.

While holding the array open, we need to reset the state of the SOMAReader object before each new query.

An example is shown in this test: https://github.com/single-cell-data/TileDB-SOMA/blob/main/libtiledbsoma/test/test_soma_reader.py#L257-L284

Shelnutt2 commented 1 year ago

This has been implemented, so we are going to close this Issue. A new one has been opened for documenting the design.

single-cell-data / SOMA