Conventions for keeping track of creation- and modification-dates

JobLeonard commented 6 years ago

An unfortunate aspect of the HDF5 format is that opening a file makes the operating system treat it as modified, even if nothing changed and even if it is opened in read-only mode. On Windows this results in a changed modification date, and on Linux and OSX there is no distinction between modification or creation dates to begin with.

It would be useful to be able to keep track of real changes to loom files. This could be used to trigger automatic updates in various work-flows, for example. Having different levels of granularity about what was changed would also be useful here. It would also be useful if there was a convention for this, so that modification-detecting scripts from different groups would work with each other's files without much trouble. If the loompy library would inherently keep track of and update these modification tracking attributes, things would be even easier, since people would not have to think about it.

To give a concrete example: to serve data from a loom file to a website, a loom-viewer server must extract the data from the loom file and convert it to JSON. This is a relatively slow process, and on top of that h5py does not like it when an HDF5 file is opened by multiple processes (even in read-only mode). So to mitigate this issue, whenever JSON data is generated it is also cached as a zipped static file. The next time someone requests that data, the static file is served instead of repeating the whole process.

The problems start if the data in the loom file is modified (for example, when a column attribute is added to loom file). At this point, existing JSON files that are outdated have to be replaced. It is currently not possible to detect when to do this automatically - it needs to be done manually by whomever is modifying the loom file.

One way around this would be to have a global attribute, or multiple attributes, that are used to keep track of real file changes. Being able to distinguish different kinds of modifications would be nice too. For the loom-viewer the following level of precision is enough:

file metadata,
attributes (global, row and column)
the data matrix as a whole

But perhaps other people have a use for more fine-grained checks (detecting which rows were changed, for example).

I would like to hear the thoughts of others on this, and come up with a shared proposal for how to handle this.

slinnarsson commented 6 years ago

We could have HDF Attributes on each HDF5 dataset, which give the last modification timestamp. This would apply to every Loom row and column attribute, every layer and every graph (all of which are HDF5 Datasets).

The problem with this approach (or any similar idea) is how it will be enforced. We could enforce it in loompy, but we cannot enforce it on the file format level. In other words, every implementation of Loom would need to accurately write the timestamps every time anything is modified.

But maybe that's ok. The worst-case scenario is that updates are not recorded, which is what we have today. Loom-viewer would need a mechanism to manually flush the cache.

JobLeonard commented 6 years ago

The loom viewer has such a mechanism, by calling the tiling/expansion commands with one of two different flags: --clean (or -C) to remove the selected caches, and --truncate (or -t) to overwrite existing caches, so there always is a fall-back.

Actually, do we need timestamps? The main need for me was keeping track of modifications; something as simple as an integer counter that increases by one every time a file gets modified is sufficient there too (if we use a 64-bit integer and record a 1000 modifications per second, it would still take roughly 600 million years to overflow).

Keeping track of when it was modified is a different problem, although I suppose there could be some use for that too: if a data-set becomes corrupted or is tampered with, time-stamps could be useful to track down when that might have happened (although there is nothing stopping anyone from tampering with the time-stamp either).

If we do go with time-stamps, I propose that to use the ISO 8601 standard of formatting (so something like 20180116T184331Z), to ensure consistency between libraries. It also happens to make modification checks as simple as a greater-than string comparison.

slinnarsson commented 6 years ago

I implemented modification timestamps as follows:

In the loom file itself

The HDF5 attribute last_modified is set to an ISO8601 timestamp in the UTC timezone in the compact format (e.g. 20180124T100436.901000Z).

The last_modified HDF5 attribute is set on:

/ (the root of the file)
/matrix
/layers/{name}
/row_edges
/row_edges/{name}
/col_edges
/col_edges/{name}
/row_attrs
/row_attrs/{name}
/col_attrs
/col_attrs/{name}

The modification timestamp at any level indicates the most recent modification time for any item below it in the HDF5 hierarchy.

In loompy

ds.last_modified(): Modification timestamp for whole file. Will timestamp the file if it doesn't have a timestamp already.

ds.layers.last_modified(): Timestamp for layers ds.layers.last_modified(name): Timestamp for specific layer

ds.col_attrs.last_modified(): Timestamp for column attributes ds.col_attrs.last_modified(name): Timestamp for specific column attribute

And so on for row attrs and graphs.

Finally, you can get a changeset relative to a given timestamp, like so:

ds.get_changes_since(timestamp): returns a dictionary of layers, attributes and graphs that have been modified since the given timestamp. For example:

with loompy.connect("/Users/sten/build_20171205_bak/L5_All.loom") as ds:
    print(ds.get_changes_since("20180124T100436.901000Z"))

Returns

{'row_graphs': [], 'col_graphs': [], 'row_attrs': ['Accession', 'Gene', '_LogCV', '_LogMean', '_Selected', '_Total', '_Valid'], 'col_attrs': ['Age', 'Bucket', 'CellID', 'Class', 'ClassProbability_Astrocyte', 'ClassProbability_Astrocyte,Immune', 'ClassProbability_Astrocyte,Neurons', 'ClassProbability_Astrocyte,Oligos', 'ClassProbability_Astrocyte,Vascular', 'ClassProbability_Bergmann-glia', 'ClassProbability_Blood', 'ClassProbability_Blood,Vascular', 'ClassProbability_Enteric-glia', 'ClassProbability_Enteric-glia,Cycling', 'ClassProbability_Ependymal', 'ClassProbability_Ex-Neurons', 'ClassProbability_Ex-Vascular', 'ClassProbability_Immune', 'ClassProbability_Immune,Neurons', 'ClassProbability_Immune,Oligos', 'ClassProbability_Neurons', 'ClassProbability_Neurons,Cycling', 'ClassProbability_Neurons,Oligos', 'ClassProbability_Neurons,Satellite-glia', 'ClassProbability_Neurons,Vascular', 'ClassProbability_OEC', 'ClassProbability_Oligos', 'ClassProbability_Oligos,Cycling', 'ClassProbability_Oligos,Vascular', 'ClassProbability_Satellite-glia', 'ClassProbability_Satellite-glia,Cycling', 'ClassProbability_Satellite-glia,Schwann', 'ClassProbability_Schwann', 'ClassProbability_Ttr', 'ClassProbability_Vascular', 'ClusterName', 'Clusters', 'Comment', 'Description', 'Developmental_compartment', 'LeafOrder', 'Location_based_on', 'MitoRiboRatio', 'Neurotransmitter', 'OriginalClusters', 'Outliers', 'Probable_location', 'Region', 'SampleID', 'Sex', 'Subclass', 'TaxonomyRank1', 'TaxonomyRank2', 'TaxonomyRank3', 'TaxonomyRank4', 'TaxonomySymbol', 'Taxonomy_group', 'Tissue', '_NGenes', '_Total', '_Valid', '_X', '_Y'], 'layers': ['']}

JobLeonard commented 6 years ago

Fantastic! I will continue working on making the loom-viewer auto-update caches after a few more pressing bugs are fixed!

slinnarsson commented 6 years ago

Also, I made a function timestamp() in loompy which generates the timestamp in the correct format. Best to use this if you plan on generating dates for comparison with last_modified().

JobLeonard commented 6 years ago

Great, that means I can strip out the old code relying on the filesystem.

JobLeonard commented 6 years ago

I noticed that in the case of missing timestamps, the most current one will be returned:

Return a compact ISO8601 timestamp (UTC timezone) indicating when the file was last modified

Note: if the layer does not contain a timestamp, and the mode is 'r+', a new timestamp will be set and returned. Otherwise, the current time in UTC will be returned.

I propose to modify this to instead return 19700101T000000Z (in other words: Unix Time if timestamps are missing and the file is opened in read-only mode. Two reasons for this:

Due to how last_modified is self-initialising, the only context in which the timestamp will remain missing, is when the loom file is consistently opened in read-only mode. This implies that the content of the file will stay unmodified, removing the need to update the cache.
Returning the most recent UTC does not allow us distinguish missing timestamps from very recent ones (calling timestamp again would return a later timestamp, after all). By comparison, Unix Time is a clear marker for missing time data, allowing us to take special actions if desired (we can always decide to fall back to a current time stamp if needed).

JobLeonard commented 6 years ago

Reopening to ask if @mojaveazure and @falexwolf are on board with this convention (including the Unix Time fallback), or see edge-cases where this would be problematic.

The main reason is that if, for whatever reason, a loom file is worked on through loompy and one of the other packages, it won't keep track of modifications done by the other libraries, meaning we can't auto-update the cache for the loom viewer in that case. This would lead to confused and frustrated biologists wondering why the offline viewer does not reflect the changes they made.

linnarsson-lab / loompy

Conventions for keeping track of creation- and modification-dates #26

In the loom file itself

In loompy