Cache attributes - Githubissues

axtimwalde commented 3 years ago

Filesystem abstraction and meta data caching

bogovicj commented 1 year ago

There are currently two nearly identical methods in N5KeyValueReader

public JsonElement getAttributes(String)
protected JsonElement readAttributes(String)
- we should get rid of this one

cmhulbert commented 1 year ago

There are currently two nearly identical methods in N5KeyValueReader

public JsonElement getAttributes(String)

protected JsonElement readAttributes(String)

we should get rid of this one

I agree, I removed that one.

cmhulbert commented 1 year ago

Working with the KeyValueAccess API, I wanted to suggest some changes. There are quite a few methods that are fairly specific to file systems, and don't abstract well. For example:

    public boolean isDirectory(String normalPath);
    public boolean isFile(String normalPath);
    public String[] listDirectories(final String normalPath) throws IOException;
    public void createDirectories(final String normalPath) throws IOException;

I think it would be more general to remove isFile and to change the Directory related methods to isGroup, listGroups, and createGroups. These will still be directories in the case of FS based N5, but may be non-directory backed keys for other access backends (S3, H5, etc.)

cmhulbert commented 1 year ago

Working with the KeyValueAccess API, I wanted to suggest some changes. There are quite a few methods that are fairly specific to file systems, and don't abstract well. For example:
  public boolean isDirectory(String normalPath);
  public boolean isFile(String normalPath);
  public String[] listDirectories(final String normalPath) throws IOException;
  public void createDirectories(final String normalPath) throws IOException;
I think it would be more general to remove isFile and to change the Directory related methods to isGroup, listGroups, and createGroups. These will still be directories in the case of FS based N5, but may be non-directory backed keys for other access backends (S3, H5, etc.)

Per a chat with @axtimwalde and @bogovicj, there are reasons to keep directory related methods. Namely, to emulate some of the FileSystem primitives that are necessary to abstract the backend storage as a KeyValue Access

bogovicj commented 1 year ago

Should this instead be:

if (root == null || !root.isJsonObject()) {
    return null;
}

?

cmhulbert commented 1 year ago

Good catch, I'll fix that!

bogovicj commented 1 year ago

Found an issue with removeAttribute - it seems only to work for the root directory right now. If I add this to the test it fails:

writer.createGroup("foo");
writer.setAttribute("foo", "a", 100);
writer.removeAttribute("foo", "a");
assertNull(writer.getAttribute("foo", "a", Integer.class));

expect it to be a simple fix. Fixed by 36ba7ae below.

bogovicj commented 1 year ago

assorted thoughts:

datasetExists currently throws IOException because it calls getDatasetAttributes which also does. Why not have datasetExists catch the exception and return false in that case?
the key value reader implementation probably doesn't need to check if dataset attributes exist, and can just let it fail if it doesn't - from @cmhulbert
- similarly, datasetExists can call getDatasetAttributes only without checking for group existence.
Consider making N5KeyValueReader and N5KeyValueWriter abstract with abstract checkVersion method.

TODOs

[ ] need unsafe constructors
[ ] need writer flag to not create the container if it doesn't exist
[ ] relevant changes in N5Factory upstream

bogovicj commented 1 year ago

In discussion with

Separate Gson and KeyValueReader elements. Make an interface called : CachedGsonN5KeyValueReader or similar that N5KeyValueReader implements.

Edit: we decided against this. For one 550fe32, replaces the GsonN5Reader/Writer interfaces with GsonUtils

bogovicj commented 1 year ago

In talking with @cmhulbert , it may be that we want to use N5URL.normalizePath in the N5Readers / Writers, and keyValueAccess.normalize (and remove leading slashes) for paths within the container that are used by the cache.

Edit: see https://github.com/saalfeldlab/n5/tree/cache-normalization

bogovicj commented 1 year ago

Currently we're seeing this behavior here

N5Writer n5 = ...
n5.setAttribute(groupName, "key", 0.1);
n5.getAttribute(groupName, "key", Integer.class )); // returns 0 (i.e. rounds rather than return null)

The issue seems to be deep in Gson https://github.com/google/gson/blob/29c93895bbcaed02178abc9e3d47b73878aaca73/gson/src/main/java/com/google/gson/internal/LazilyParsedNumber.java#L42

bogovicj commented 1 year ago

@cmhulbert will want to ask you about this when you have time https://github.com/saalfeldlab/n5/pull/77/commits/7c37cf9f6b6f05df143eb00ae58f97dc04e3dd25

bogovicj commented 1 year ago

Re the discussion on zulip,

We could consider adding a protected constructor to N5KeyValueReader that skips the version check. This could let us avoid sneaking in the static initializeContainer in the N5KeyValueWriter constructor before its superclass (N5KeyValueReader) constructor is called. See here:

https://github.com/saalfeldlab/n5/tree/kv-reader-cons

bogovicj commented 1 year ago

@axtimwalde , branch of interest: https://github.com/saalfeldlab/n5/tree/wip/KeyValueInterface

That branch also includes use of a protected constructor for the N5KeyValueReader and no longer needs a static initializeContainer method (details above and on zulip)

@cmhulbert , Looking at the cache again, I think some (hopefully small) edits are needed for writing. Specifically, what happens now on on group / dataset creation (I think), is:

Writer creates what it has to do through the key-value-access
Writer triggers a cache update
Cache update triggers a reading of the backend
- not necessary if the writer can update the cache directly
- possibly unsafe if the writing failed, but better to detect that

bogovicj commented 1 year ago

Toward making the cache work more nicely with writers: https://github.com/saalfeldlab/n5/commit/239b3d910d7a5fec3d1dcf0f377feb25c8e1c341
Use an interface (N5JsonCachableContainer) instead of funcitons / consumers for the cache: https://github.com/saalfeldlab/n5/commit/069ceda0a745c1f343499ae3904a7982c66f3155
remove the temporary classes: https://github.com/saalfeldlab/n5/commit/df285ee43b60bc890fd1937a7c8e4a4666ffca4e

bogovicj commented 1 year ago

I ran this benchmark comparing the current master https://github.com/saalfeldlab/n5/commit/a1fcd2f6b3be7e1e10b08aaeba3912ffb5707d8c to the latest development commit https://github.com/saalfeldlab/n5/commit/d14fe554178b180cea4aeba50013c245c0b39b8b . In summary: they're comparable.

current master

``` 1 threads. 1 : raw : 0.118000s 1 : bzip2 : 26.535000s 1 : gzip : 3.202000s 1 : lz4 : 0.211000s 1 : xz : 19.963000s 1 : tif : 0.084000s 1 : hdf5 raw : 0.116000s 1 : hdf5 gzip : 1.875000s 2 threads. 2 : raw : 0.045000s 2 : bzip2 : 15.311000s 2 : gzip : 1.798000s 2 : lz4 : 0.109000s 2 : xz : 10.856000s 2 : tif : 0.039000s 2 : hdf5 raw : 0.035000s 2 : hdf5 gzip : 1.875000s 4 threads. 4 : raw : 0.025000s 4 : bzip2 : 9.451000s 4 : gzip : 1.016000s 4 : lz4 : 0.063000s 4 : xz : 6.578000s 4 : tif : 0.019000s 4 : hdf5 raw : 0.037000s 4 : hdf5 gzip : 1.933000s 8 threads. 8 : raw : 0.032000s 8 : bzip2 : 7.119000s 8 : gzip : 0.667000s 8 : lz4 : 0.041000s 8 : xz : 4.864000s 8 : tif : 0.014000s 8 : hdf5 raw : 0.036000s 8 : hdf5 gzip : 1.950000s 16 threads. 16 : raw : 0.015000s 16 : bzip2 : 6.383000s 16 : gzip : 0.474000s 16 : lz4 : 0.026000s 16 : xz : 4.278000s 16 : tif : 0.011000s 16 : hdf5 raw : 0.041000s 16 : hdf5 gzip : 2.173000s ```

latest development

``` 1 threads. 1 : raw : 0.117000s 1 : bzip2 : 26.146000s 1 : gzip : 3.463000s 1 : lz4 : 0.204000s 1 : xz : 19.513000s 1 : tif : 0.140000s 1 : hdf5 raw : 0.113000s 1 : hdf5 gzip : 1.762000s 2 threads. 2 : raw : 0.069000s 2 : bzip2 : 14.401000s 2 : gzip : 1.771000s 2 : lz4 : 0.109000s 2 : xz : 10.648000s 2 : tif : 0.042000s 2 : hdf5 raw : 0.036000s 2 : hdf5 gzip : 1.876000s 4 threads. 4 : raw : 0.027000s 4 : bzip2 : 9.021000s 4 : gzip : 1.004000s 4 : lz4 : 0.062000s 4 : xz : 6.474000s 4 : tif : 0.022000s 4 : hdf5 raw : 0.038000s 4 : hdf5 gzip : 1.898000s 8 threads. 8 : raw : 0.024000s 8 : bzip2 : 7.307000s 8 : gzip : 0.662000s 8 : lz4 : 0.041000s 8 : xz : 4.729000s 8 : tif : 0.016000s 8 : hdf5 raw : 0.040000s 8 : hdf5 gzip : 2.000000s 16 threads. 16 : raw : 0.025000s 16 : bzip2 : 6.529000s 16 : gzip : 0.480000s 16 : lz4 : 0.028000s 16 : xz : 4.268000s 16 : tif : 0.011000s 16 : hdf5 raw : 0.044000s 16 : hdf5 gzip : 2.146000s ```

saalfeldlab / n5

Cache attributes #77

TODOs