saalfeldlab / n5

Not HDF5
BSD 2-Clause "Simplified" License
158 stars 22 forks source link

Cache attributes #77

Closed axtimwalde closed 1 year ago

axtimwalde commented 3 years ago

Filesystem abstraction and meta data caching

bogovicj commented 1 year ago

There are currently two nearly identical methods in N5KeyValueReader

cmhulbert commented 1 year ago

There are currently two nearly identical methods in N5KeyValueReader

I agree, I removed that one.

cmhulbert commented 1 year ago

Working with the KeyValueAccess API, I wanted to suggest some changes. There are quite a few methods that are fairly specific to file systems, and don't abstract well. For example:

    public boolean isDirectory(String normalPath);
    public boolean isFile(String normalPath);
    public String[] listDirectories(final String normalPath) throws IOException;
    public void createDirectories(final String normalPath) throws IOException;

I think it would be more general to remove isFile and to change the Directory related methods to isGroup, listGroups, and createGroups. These will still be directories in the case of FS based N5, but may be non-directory backed keys for other access backends (S3, H5, etc.)

cmhulbert commented 1 year ago

Working with the KeyValueAccess API, I wanted to suggest some changes. There are quite a few methods that are fairly specific to file systems, and don't abstract well. For example:

  public boolean isDirectory(String normalPath);
  public boolean isFile(String normalPath);
  public String[] listDirectories(final String normalPath) throws IOException;
  public void createDirectories(final String normalPath) throws IOException;

I think it would be more general to remove isFile and to change the Directory related methods to isGroup, listGroups, and createGroups. These will still be directories in the case of FS based N5, but may be non-directory backed keys for other access backends (S3, H5, etc.)

Per a chat with @axtimwalde and @bogovicj, there are reasons to keep directory related methods. Namely, to emulate some of the FileSystem primitives that are necessary to abstract the backend storage as a KeyValue Access

bogovicj commented 1 year ago

Should this instead be:

if (root == null || !root.isJsonObject()) {
    return null;
}

?

cmhulbert commented 1 year ago

Good catch, I'll fix that!

bogovicj commented 1 year ago

Found an issue with removeAttribute - it seems only to work for the root directory right now. If I add this to the test it fails:

writer.createGroup("foo");
writer.setAttribute("foo", "a", 100);
writer.removeAttribute("foo", "a");
assertNull(writer.getAttribute("foo", "a", Integer.class));

expect it to be a simple fix. Fixed by 36ba7ae below.

bogovicj commented 1 year ago

assorted thoughts:

TODOs

bogovicj commented 1 year ago

In discussion with

Separate Gson and KeyValueReader elements. Make an interface called : CachedGsonN5KeyValueReader or similar that N5KeyValueReader implements.

Edit: we decided against this. For one 550fe32, replaces the GsonN5Reader/Writer interfaces with GsonUtils

bogovicj commented 1 year ago

In talking with @cmhulbert , it may be that we want to use N5URL.normalizePath in the N5Readers / Writers, and keyValueAccess.normalize (and remove leading slashes) for paths within the container that are used by the cache.

Edit: see https://github.com/saalfeldlab/n5/tree/cache-normalization

bogovicj commented 1 year ago

Currently we're seeing this behavior here

N5Writer n5 = ...
n5.setAttribute(groupName, "key", 0.1);
n5.getAttribute(groupName, "key", Integer.class )); // returns 0 (i.e. rounds rather than return null)

The issue seems to be deep in Gson https://github.com/google/gson/blob/29c93895bbcaed02178abc9e3d47b73878aaca73/gson/src/main/java/com/google/gson/internal/LazilyParsedNumber.java#L42

bogovicj commented 1 year ago

@cmhulbert will want to ask you about this when you have time https://github.com/saalfeldlab/n5/pull/77/commits/7c37cf9f6b6f05df143eb00ae58f97dc04e3dd25

bogovicj commented 1 year ago

Re the discussion on zulip,

We could consider adding a protected constructor to N5KeyValueReader that skips the version check. This could let us avoid sneaking in the static initializeContainer in the N5KeyValueWriter constructor before its superclass (N5KeyValueReader) constructor is called. See here:

https://github.com/saalfeldlab/n5/tree/kv-reader-cons

bogovicj commented 1 year ago

@axtimwalde , branch of interest: https://github.com/saalfeldlab/n5/tree/wip/KeyValueInterface

@cmhulbert , Looking at the cache again, I think some (hopefully small) edits are needed for writing. Specifically, what happens now on on group / dataset creation (I think), is:

bogovicj commented 1 year ago
bogovicj commented 1 year ago

I ran this benchmark comparing the current master https://github.com/saalfeldlab/n5/commit/a1fcd2f6b3be7e1e10b08aaeba3912ffb5707d8c to the latest development commit https://github.com/saalfeldlab/n5/commit/d14fe554178b180cea4aeba50013c245c0b39b8b . In summary: they're comparable.

current master ``` 1 threads. 1 : raw : 0.118000s 1 : bzip2 : 26.535000s 1 : gzip : 3.202000s 1 : lz4 : 0.211000s 1 : xz : 19.963000s 1 : tif : 0.084000s 1 : hdf5 raw : 0.116000s 1 : hdf5 gzip : 1.875000s 2 threads. 2 : raw : 0.045000s 2 : bzip2 : 15.311000s 2 : gzip : 1.798000s 2 : lz4 : 0.109000s 2 : xz : 10.856000s 2 : tif : 0.039000s 2 : hdf5 raw : 0.035000s 2 : hdf5 gzip : 1.875000s 4 threads. 4 : raw : 0.025000s 4 : bzip2 : 9.451000s 4 : gzip : 1.016000s 4 : lz4 : 0.063000s 4 : xz : 6.578000s 4 : tif : 0.019000s 4 : hdf5 raw : 0.037000s 4 : hdf5 gzip : 1.933000s 8 threads. 8 : raw : 0.032000s 8 : bzip2 : 7.119000s 8 : gzip : 0.667000s 8 : lz4 : 0.041000s 8 : xz : 4.864000s 8 : tif : 0.014000s 8 : hdf5 raw : 0.036000s 8 : hdf5 gzip : 1.950000s 16 threads. 16 : raw : 0.015000s 16 : bzip2 : 6.383000s 16 : gzip : 0.474000s 16 : lz4 : 0.026000s 16 : xz : 4.278000s 16 : tif : 0.011000s 16 : hdf5 raw : 0.041000s 16 : hdf5 gzip : 2.173000s ```
latest development ``` 1 threads. 1 : raw : 0.117000s 1 : bzip2 : 26.146000s 1 : gzip : 3.463000s 1 : lz4 : 0.204000s 1 : xz : 19.513000s 1 : tif : 0.140000s 1 : hdf5 raw : 0.113000s 1 : hdf5 gzip : 1.762000s 2 threads. 2 : raw : 0.069000s 2 : bzip2 : 14.401000s 2 : gzip : 1.771000s 2 : lz4 : 0.109000s 2 : xz : 10.648000s 2 : tif : 0.042000s 2 : hdf5 raw : 0.036000s 2 : hdf5 gzip : 1.876000s 4 threads. 4 : raw : 0.027000s 4 : bzip2 : 9.021000s 4 : gzip : 1.004000s 4 : lz4 : 0.062000s 4 : xz : 6.474000s 4 : tif : 0.022000s 4 : hdf5 raw : 0.038000s 4 : hdf5 gzip : 1.898000s 8 threads. 8 : raw : 0.024000s 8 : bzip2 : 7.307000s 8 : gzip : 0.662000s 8 : lz4 : 0.041000s 8 : xz : 4.729000s 8 : tif : 0.016000s 8 : hdf5 raw : 0.040000s 8 : hdf5 gzip : 2.000000s 16 threads. 16 : raw : 0.025000s 16 : bzip2 : 6.529000s 16 : gzip : 0.480000s 16 : lz4 : 0.028000s 16 : xz : 4.268000s 16 : tif : 0.011000s 16 : hdf5 raw : 0.044000s 16 : hdf5 gzip : 2.146000s ```