Open johnkerl opened 1 year ago
Here's some draft copy for the PIT consistency features. I'm not sure exactly where this should live; it's higher-level guidance touching on multiple individual fields/methods (so doesn't belong in a docstring), but it's also TileDB-specific (so doesn't belong in the SOMA/somacore prose either). Ideas welcome @thetorpedodog @johnkerl @bkmartinjr @ebezzi
TileDB-SOMA uses TileDB’s Time Traveling features to open SOMA objects at a certain timestamp, ensuring a consistent point-in-time view of the data even if it’s been updated since or concurrently. The default timestamp is the current time when a SOMAObject
is opened; and when opening a SOMACollection
specifically, all its members are also opened at the same timestamp as the top-level collection.
A publisher may specify a certain timestamp at which a given SOMA object URI should be opened; usually the time when they finished some batch of write operations, and anticipate later updating the data in-place. Readers can supply this timestamp to TileDB-SOMA either through the tiledb_timestamp
argument to the open()
method (LINK) or in the timestamp
field of a SOMATileDBContext
(LINK) object passed to open()
. The timestamp may be specified either as an integer milliseconds since Unix epoch, or a language-native date/time object.
Writers can also set these timestamp arguments in the same way opening TileDB-SOMA objects in write mode, overriding the default write timestamps. This can be useful for testing the point-in-time consistency features, but production writers should typically leave the timestamp arguments unset to let the defaults apply.
@mlin I think the text is awesome!
Re where to put it ... in software docs there is typically API + narrative -- the former autogenned from docstrings and the latter hand-written. This could go in the latter. I'm still working on #1041 & don't have a place for this just yet but I think that's where it should go ...
I like the overall gist a lot. There are a few minor adjustments I think might help:
open
either reflect its state at the same timestamp, or take effect at the same timestamp. (What I’m trying to say here can definitely be explained better but basically amounts to just expanding on what is already there.)Altogether, this would make the overall structure of this little part be, roughly:
TileDB-SOMA uses TileDB’s Time Traveling features to open SOMA objects at a certain timestamp. This provides a consistent point-in-time view of the data even if it’s been updated subsequently (or concurrently). The timestamp defaults to the current time when opening a SOMAObject
; and when opening a SOMACollection
, all its members are opened at the same timestamp as the initially-opened collection, including all nested collections and objects.
The timestamp can be overridden by setting either the tiledb_timestamp
argument to the open()
method (LINK), or the timestamp
field of a SOMATileDBContext
(LINK) object passed to open()
. The value is usually a long integer milliseconds since Unix epoch, but a language-native date/time object may also be converted automatically.
Readers may override the timestamp with a value that a dataset publisher has informed them should be used with a given TileDB-SOMA object URI. That's typically the time when the publisher finished some internally-consistent batch of write operations, but anticipates later updating the data in-place. By setting the known timestamp, readers will be unaffected by any subsequent writes.
Writers can also override the default timestamp when opening TileDB-SOMA objects in write mode, conferring the specified timestamp to all data and metadata written to that object and all its collection members, if applicable. This can be useful for testing these point-in-time consistency features, but production writers should typically leave the timestamp fields unset to let the defaults apply.
This is looking really good; I have just a couple other minor notes:
For ease of use, I think we should promote the datetime version as the preferred format for timestamps, and note that it is stored as Unix millis internally (which you can also provide as a raw integer).
In the reading paragraph, maybe the mechanism first and then how it can be used, similar to how you’ve done in the writing paragraph:
Readers may override the timestamp when opening an object, providing a view of the object as of that point. For instance, if a publisher has completed an internally-consistent batch of write operations, they can inform users of that dataset of this timestamp. This allows readers to open the dataset as of that timestamp, and that consistent version of the data without being affected by subsequent changes.
You can probably improve on my copy; it’s the structure that I’m thinking about.
@thetorpedodog I like your reading paragraph.
For datetime vs long, maybe we meet in the middle and just mention the two options neutrally. For Cell Census, I'm pretty sure we'd publish a timestamp
long alongside uri
and s3_region
in the locator JSON, and our open_soma
wrapper method would just pass through that JSON field. I find that completely straightforward while sending it through a datetime would at least raise a question in my mind about time zones (even if RTFM would quickly reassure me).
@johnkerl @thetorpedodog Just following up on this old ticket -- is there a natural place in the tiledbsoma docs for a section like this?
I think so, either in the hand-written section at the top, or, work it into a class/method docstring if you prefer
Or make a new section above/below "Tutorials" -- cc @ebezzi https://tiledbsoma.readthedocs.io/en/latest/tutorials.html
Follow-on work from #540