Open d70-t opened 2 years ago
Hey @d70-t !
This may be implemented in a similar way to what you are thinking? It may be helpful to run the example in the README and look at the local output of files and folders.
I just ran the example in the README and had a look at config
:
'version': '0.3.0',
'name': 'ShardedStore',
'module': 'shardedstore',
'config': {'args': [{'name': 'DirectoryStore',
'module': 'zarr.storage',
'config': {'args': ['/Users/tobi/Documents/code/shardedstore/base.zarr'],
'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
'version': '2.11.3'}],
'kwargs': {'dimension_separator': None,
'shards': {'people': {'name': 'DirectoryStore',
'module': 'zarr.storage',
'config': {'args': ['/Users/tobi/Documents/code/shardedstore/shard1.zarr'],
'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
'version': '2.11.3'},
'species': {'name': 'DirectoryStore',
'module': 'zarr.storage',
'config': {'args': ['/Users/tobi/Documents/code/shardedstore/shard2.zarr'],
'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
'version': '2.11.3'}},
'array_shard_dims': {'simulation/coarse/foo': 1, 'simulation/fine/foo': 1},
'array_shards': {'simulation/coarse/foo': {'0': {'name': 'DirectoryStore',
'module': 'zarr.storage',
'config': {'args': ['/Users/tobi/Documents/code/shardedstore/array_shards1/0.zarr'],
'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
'version': '2.11.3'},
'1': {'name': 'DirectoryStore',
'module': 'zarr.storage',
'config': {'args': ['/Users/tobi/Documents/code/shardedstore/array_shards1/1.zarr'],
'kwargs': {'normalize_keys': False, 'dimension_separator': None}},
'version': '2.11.3'}},
...
Maybe what I'm wondering is: why is array_shards
and array_shard_dims
needed any why can't array_shards
just be part of shards
? E.g.:
'shards' {
'simulation/coarse/foo/0': ...,
'simulation/coarse/foo/1': ...,
}
The example's resulting stores on the filesystem may provide more direct insights.
A few necessary differences between shards
and array_shards
:
shard
. For array_shards
, there are multiple.array_shard_dims
defines the depth of sharding in the array dimensions...zarray
to ensure it is correct / compatible.Note that the shard stores could be any existing Zarr store (there is a default behavior for serializing the configuration, that may need tweaks in some cases). It works with v2 stores, it should work with v3 stores, but I have not tested this. So, IPLD store, AWS store, GCS store, Zip store, another ShardedStore, etc. This simplicity -- there are not changes to the Zarr spec or Zarr array implementation or implementations of all the Zarr stores, is one of the features here.
a translation layer (or configurable naming layer)
Note that this is array_shard_directory_store
, which can be replaced with a function that uses different stores or a different naming convention.
A few necessary differences between
shards
andarray_shards
:
- There is one store for a
shard
. Forarray_shards
, there are multiple.
I don't really get this requirement. I see, that it's convenient to have a generating function (e.g. array_shard_directory_store
to create multiple of those). But that generating function isn't part of the config
anymore. And if each of the array_shards
would have its own mountpoint, then all of them could be placed in shards
. See below for more on that.
- The
array_shard_dims
defines the depth of sharding in the array dimensions..
Yes, something like this is necessary, but I'd argue that this probably should become part of an translation layer, which would be only loosely coupled to the sharding thing?
- Before being stored in array shard store, compatibility checks and a transformation occurs on the
.zarray
to ensure it is correct / compatible.
Is it necessary to put .zarray
etc... in the shards?
I've prepared a little unpolished gist about what I've in my mind. Chances are, that I didn't yet get all the goals of shardedstore
, so maybe my thinking is just not aligned with them.
In the linked gist, the RenumberShardsStore
groups nearby chunks into "subfolders" wich could be used later on as mountpoints for a ShardedStore
(to be used instead of the zarr.storage.MemoryStore
). In that setting, ShardedStore
would only have to care about directing mountpoints to each backing store, but won't have to think about reshaping or aggregating chunks. My hope would be, that this separation of concerns would help to play around with different kinds data packing, independently of how chunks are to be grouped into shards. And if RenumberShardsStore
would be moved further upwards the stack into the array implementation, then the array would know which chunks should be written in batches, without knowing anything about how it's actually stored.
EDIT: I just updated the gist to actually use the ShardedStore
.
Hey @thewtex, thanks for starting the
shardedstore
!One thing I've been wondering with sharding is, if there should be a translation layer (or configurable naming layer) for chunk keys somewhere close to the array implementation and in particular on top of any other store (transformers). However, I don't know if I've been able to communitcate that good enough.
If I get it right, I've seen in the current
shardedstore
, there are two different kinds of specifying sharding: one is by defining paths (i.e. mounting a store into a sharded store) and one is by defining array shards, which I didn't yet figure out so much how it works. I'm wondering if these two things could be unified by translating the chunk keys/paths in a way that array chunks become subfolders. Then sharding could happen solely on the subfolder mounting approach?If that translation would happen very close (or even in) to the array implementation, an additional benefit might be that the array could eventually schedule writes on a per shard / per subfolder basis, without knowing how exactly the sharding will happen below.