single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

[Feature request] User-facing documentation for point-in-time consistency #924

Open johnkerl opened 1 year ago

johnkerl commented 1 year ago

Follow-on work from #540

mlin commented 1 year ago

Here's some draft copy for the PIT consistency features. I'm not sure exactly where this should live; it's higher-level guidance touching on multiple individual fields/methods (so doesn't belong in a docstring), but it's also TileDB-specific (so doesn't belong in the SOMA/somacore prose either). Ideas welcome @thetorpedodog @johnkerl @bkmartinjr @ebezzi


TileDB-SOMA uses TileDB’s Time Traveling features to open SOMA objects at a certain timestamp, ensuring a consistent point-in-time view of the data even if it’s been updated since or concurrently. The default timestamp is the current time when a SOMAObject is opened; and when opening a SOMACollection specifically, all its members are also opened at the same timestamp as the top-level collection.

A publisher may specify a certain timestamp at which a given SOMA object URI should be opened; usually the time when they finished some batch of write operations, and anticipate later updating the data in-place. Readers can supply this timestamp to TileDB-SOMA either through the tiledb_timestamp argument to the open() method (LINK) or in the timestamp field of a SOMATileDBContext (LINK) object passed to open(). The timestamp may be specified either as an integer milliseconds since Unix epoch, or a language-native date/time object.

Writers can also set these timestamp arguments in the same way opening TileDB-SOMA objects in write mode, overriding the default write timestamps. This can be useful for testing the point-in-time consistency features, but production writers should typically leave the timestamp arguments unset to let the defaults apply.

johnkerl commented 1 year ago

@mlin I think the text is awesome!

Re where to put it ... in software docs there is typically API + narrative -- the former autogenned from docstrings and the latter hand-written. This could go in the latter. I'm still working on #1041 & don't have a place for this just yet but I think that's where it should go ...

thetorpedodog commented 1 year ago

I like the overall gist a lot. There are a few minor adjustments I think might help:

Altogether, this would make the overall structure of this little part be, roughly:

  1. What a timestamp is and how to set it.
  2. What a reader can do with timestamps, and then why they might want to.
  3. What a writer can do with timestamps, and them why.
mlin commented 1 year ago

TileDB-SOMA uses TileDB’s Time Traveling features to open SOMA objects at a certain timestamp. This provides a consistent point-in-time view of the data even if it’s been updated subsequently (or concurrently). The timestamp defaults to the current time when opening a SOMAObject; and when opening a SOMACollection, all its members are opened at the same timestamp as the initially-opened collection, including all nested collections and objects.

The timestamp can be overridden by setting either the tiledb_timestamp argument to the open() method (LINK), or the timestamp field of a SOMATileDBContext (LINK) object passed to open(). The value is usually a long integer milliseconds since Unix epoch, but a language-native date/time object may also be converted automatically.

Readers may override the timestamp with a value that a dataset publisher has informed them should be used with a given TileDB-SOMA object URI. That's typically the time when the publisher finished some internally-consistent batch of write operations, but anticipates later updating the data in-place. By setting the known timestamp, readers will be unaffected by any subsequent writes.

Writers can also override the default timestamp when opening TileDB-SOMA objects in write mode, conferring the specified timestamp to all data and metadata written to that object and all its collection members, if applicable. This can be useful for testing these point-in-time consistency features, but production writers should typically leave the timestamp fields unset to let the defaults apply.

thetorpedodog commented 1 year ago

This is looking really good; I have just a couple other minor notes:

mlin commented 1 year ago

@thetorpedodog I like your reading paragraph.

For datetime vs long, maybe we meet in the middle and just mention the two options neutrally. For Cell Census, I'm pretty sure we'd publish a timestamp long alongside uri and s3_region in the locator JSON, and our open_soma wrapper method would just pass through that JSON field. I find that completely straightforward while sending it through a datetime would at least raise a question in my mind about time zones (even if RTFM would quickly reassure me).

mlin commented 1 year ago

@johnkerl @thetorpedodog Just following up on this old ticket -- is there a natural place in the tiledbsoma docs for a section like this?

johnkerl commented 1 year ago

I think so, either in the hand-written section at the top, or, work it into a class/method docstring if you prefer

johnkerl commented 1 year ago

Or make a new section above/below "Tutorials" -- cc @ebezzi https://tiledbsoma.readthedocs.io/en/latest/tutorials.html