renaming chunks ? - Githubissues

d70-t commented 2 years ago

Hey @thewtex, thanks for starting the shardedstore!

One thing I've been wondering with sharding is, if there should be a translation layer (or configurable naming layer) for chunk keys somewhere close to the array implementation and in particular on top of any other store (transformers). However, I don't know if I've been able to communitcate that good enough.

If I get it right, I've seen in the current shardedstore, there are two different kinds of specifying sharding: one is by defining paths (i.e. mounting a store into a sharded store) and one is by defining array shards, which I didn't yet figure out so much how it works. I'm wondering if these two things could be unified by translating the chunk keys/paths in a way that array chunks become subfolders. Then sharding could happen solely on the subfolder mounting approach?

If that translation would happen very close (or even in) to the array implementation, an additional benefit might be that the array could eventually schedule writes on a per shard / per subfolder basis, without knowing how exactly the sharding will happen below.

thewtex commented 2 years ago

Hey @d70-t !

This may be implemented in a similar way to what you are thinking? It may be helpful to run the example in the README and look at the local output of files and folders.

d70-t commented 2 years ago

I just ran the example in the README and had a look at config:

'version': '0.3.0',
 'name': 'ShardedStore',
 'module': 'shardedstore',
 'config': {'args': [{'name': 'DirectoryStore',
    'module': 'zarr.storage',
    'config': {'args': ['/Users/tobi/Documents/code/shardedstore/base.zarr'],
     'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
    'version': '2.11.3'}],
  'kwargs': {'dimension_separator': None,
   'shards': {'people': {'name': 'DirectoryStore',
     'module': 'zarr.storage',
     'config': {'args': ['/Users/tobi/Documents/code/shardedstore/shard1.zarr'],
      'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
     'version': '2.11.3'},
    'species': {'name': 'DirectoryStore',
     'module': 'zarr.storage',
     'config': {'args': ['/Users/tobi/Documents/code/shardedstore/shard2.zarr'],
      'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
     'version': '2.11.3'}},
   'array_shard_dims': {'simulation/coarse/foo': 1, 'simulation/fine/foo': 1},
   'array_shards': {'simulation/coarse/foo': {'0': {'name': 'DirectoryStore',
      'module': 'zarr.storage',
      'config': {'args': ['/Users/tobi/Documents/code/shardedstore/array_shards1/0.zarr'],
       'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
      'version': '2.11.3'},
     '1': {'name': 'DirectoryStore',
      'module': 'zarr.storage',
      'config': {'args': ['/Users/tobi/Documents/code/shardedstore/array_shards1/1.zarr'],
       'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
      'version': '2.11.3'}},
...

Maybe what I'm wondering is: why is array_shards and array_shard_dims needed any why can't array_shards just be part of shards? E.g.:

'shards' {
    'simulation/coarse/foo/0': ...,
    'simulation/coarse/foo/1': ...,
}

thewtex commented 2 years ago

The example's resulting stores on the filesystem may provide more direct insights.

A few necessary differences between shards and array_shards:

There is one store for a shard. For array_shards, there are multiple.
The array_shard_dims defines the depth of sharding in the array dimensions..
Before being stored in array shard store, compatibility checks and a transformation occurs on the .zarray to ensure it is correct / compatible.

Note that the shard stores could be any existing Zarr store (there is a default behavior for serializing the configuration, that may need tweaks in some cases). It works with v2 stores, it should work with v3 stores, but I have not tested this. So, IPLD store, AWS store, GCS store, Zip store, another ShardedStore, etc. This simplicity -- there are not changes to the Zarr spec or Zarr array implementation or implementations of all the Zarr stores, is one of the features here.

thewtex commented 2 years ago

a translation layer (or configurable naming layer)

Note that this is array_shard_directory_store, which can be replaced with a function that uses different stores or a different naming convention.

d70-t commented 2 years ago

A few necessary differences between shards and array_shards:

There is one store for a shard. For array_shards, there are multiple.

I don't really get this requirement. I see, that it's convenient to have a generating function (e.g. array_shard_directory_store to create multiple of those). But that generating function isn't part of the config anymore. And if each of the array_shards would have its own mountpoint, then all of them could be placed in shards. See below for more on that.

The array_shard_dims defines the depth of sharding in the array dimensions..

Yes, something like this is necessary, but I'd argue that this probably should become part of an translation layer, which would be only loosely coupled to the sharding thing?

Before being stored in array shard store, compatibility checks and a transformation occurs on the .zarray to ensure it is correct / compatible.

Is it necessary to put .zarray etc... in the shards?

I've prepared a little unpolished gist about what I've in my mind. Chances are, that I didn't yet get all the goals of shardedstore, so maybe my thinking is just not aligned with them.

In the linked gist, the RenumberShardsStore groups nearby chunks into "subfolders" wich could be used later on as mountpoints for a ShardedStore (to be used instead of the zarr.storage.MemoryStore). In that setting, ShardedStore would only have to care about directing mountpoints to each backing store, but won't have to think about reshaping or aggregating chunks. My hope would be, that this separation of concerns would help to play around with different kinds data packing, independently of how chunks are to be grouped into shards. And if RenumberShardsStore would be moved further upwards the stack into the array implementation, then the array would know which chunks should be written in batches, without knowing anything about how it's actually stored.

EDIT: I just updated the gist to actually use the ShardedStore.

thewtex / shardedstore

renaming chunks ? #14