pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Enable xarray `.chunk()` style input for target chunks #93

Closed jbusecke closed 2 years ago

jbusecke commented 3 years ago

I have started implementing this functionality (many thanks to @pmav99 for the help over in #83)

I think the main functionality works, but I have a few outstanding questions I would like to discuss.

The current implementation will rechunk to the full length of a dimension if the dimension is not provided in target_chunks:

If ds has two dimensions: x/y with lengths 10/20 and chunks 5/2

Then what do we want to happen to dimension y if we provide target_chunks={'x':5}? @pmav99 s implementation currently defaults to the length of y as chunksize for y.

I would say we should stay as close to the implementation of xarrays .chunk(), which I believe just leaves the chunks alone. Thoughts?

Also, should this only work for datasets or also dataarrays?

jbusecke commented 3 years ago

I am having two issues here, and would appreciate some input.

1) I have installed pre-commit but the formatting does not seem to be identical. Not really sure how to solve this. 2) All of my tests are failing with the same error NameError: name 'fsspec' is not defined. I dont get this at all since the other tests using fsspec seem to succeed. Am I missing something here?

codecov[bot] commented 3 years ago

Codecov Report

Merging #93 (67efacb) into master (73ef80b) will decrease coverage by 0.25%. The diff coverage is 100.00%.

:exclamation: Current head 67efacb differs from pull request most recent head 6d449f3. Consider uploading reports for the commit 6d449f3 to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##           master      #93      +/-   ##
==========================================
- Coverage   96.69%   96.44%   -0.26%     
==========================================
  Files          12       12              
  Lines         515      535      +20     
  Branches       93      101       +8     
==========================================
+ Hits          498      516      +18     
- Misses         11       13       +2     
  Partials        6        6              
Impacted Files Coverage Δ
rechunker/api.py 100.00% <100.00%> (ø)
rechunker/executors/pywren.py 94.28% <0.00%> (-5.72%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 73ef80b...6d449f3. Read the comment docs.

rabernat commented 3 years ago
  1. I have installed pre-commit but the formatting does not seem to be identical. Not really sure how to solve this.

I'm not sure either. I learned about pre-commit from you! It would be great if someone could sort this out for rechunker.

2. All of my tests are failing with the same error NameError: name 'fsspec' is not defined

The error I see is

 >       assert dst.a.data.chunksize == target_chunks_expected
  E       NameError: name 'dst' is not defined
jbusecke commented 3 years ago

Ok I figured out the linting. The problem was that this pre-commit-config.yml has a pretty old black version. When I was saving files things were formatted based on a different more recent version. So i just needed to make sure not to save at all after pre-commit run --all-files. To avoid this mismatch one could update the black version.

I got the test running but there is a desing question that remains:

What do we want to happen when a dimension of the dataset is not specified in the target_chunks. For instance if we have a 2D dataset (with dims x and y). What happens when we call rechunk(ds, ..., {'x':3}). Should that default to the full length of y (current implementation) or preserve given chunks along y? The latter is a bit trickier to implement, since existing chunks could be different for different variables.

rabernat commented 3 years ago

What do we want to happen when a dimension of the dataset is not specified in the target_chunks

I would say that the chunking should not change for that dimension. To indicate a contiguous dimension, you should explictly pass e.g. {'y': -1}.

jbusecke commented 3 years ago

Ok I think that would then be 100% in line with xarray, so I think that is a good way to go. Let me see how I can achieve that...

jbusecke commented 3 years ago

Yay. I think I got it to work. This will now preserve the chunksize (possibly different on different variables) of the existing datasets if not otherwise specified.

jbusecke commented 2 years ago

Super sorry for dropping the ball on this. The original logic proposed by @pmav99 above was not quite giving me the output desired. But with a minimal change (I reordered the logic so that if dim in target_chunks.keys() is checked first), the original tests pass. I also added some specific tests for get_dim_chunk.

Could you check if this looks good to you @pmav99? If so I will add docstrings and an entry to the docs.

jbusecke commented 2 years ago

The only thing that I am not really fond of is the length of test_rechunk_dataset_dimchunks (60 lines and, probably, counting). Ideally the dataset creation should be done in a function or in a pytest fixture. Nevertheless, at the moment, both session scoped fixtures and functions are being used for this purpose (in the particular module). So it is not easy to make a choice. Therefore I would suggest to keep this as it is, merge the code and perhaps refactor the test suite in the future.

That makes sense, but since this was basically just adopted from the existing tests, I think your suggestions to refactor at a later time would be better (rather than mixing new features with changes in existing tests?).

rabernat commented 2 years ago

I concur with the comments about the need to refactor the tests. This is an example of maintenance work that would really benefit rechunker but that the original devs have not managed to find time for. It would be great if @jbusecke and @pmav99 were willing to get more involved and take this on eventually.

I am willing to merge this as is, but I will open an issue about the testing refactor.

jbusecke commented 2 years ago

I see your point @rabernat , but the problem for me is that I do not know a lot of the internals here, so it will take me some time to figure this out.

In the meantime, I have a more important design question regarding this feature. I was just looking through the tutorial in the docs and I am wondering if we actually want to chunk dimension arrays? I believe the current code will chunk e.g. a 1D dimension (x) array the same way as an ND array which has multiple dimensions (x, y,z). I am not sure if that is desireable or not.

rabernat commented 2 years ago

I am wondering if we actually want to chunk dimension arrays?

The original version of rechunker only operated on Zarr arrays / groups, without no option for Xarray inputs. That was added after. So we really don't distinguish between dimension and non-dimension coordinates (or between coordinates and data variables at all). This is what we say about rechunking Zarr groups

For a group of arrays, a dict is required. The keys correspond to array names. The values are target_chunks arguments for the array. For example, {'foo': (20, 10), 'bar': {'x': 3, 'y': 5}, 'baz': None}. All arrays you want to rechunk must be explicitly named. Arrays that are not present in the target_chunks dict will be ignored.

We don't really specify the behavior for Xarray datasets clearly. Which means it is yet to be defined.

Looking more closely at the code, I am starting to think we need to split the current high-level API, which basically only exposes the rechunk function, into several different functions:

...and deprecate rechunk. That would allow us to have different options and default that are appropriate to each scenario. It would also remove a ton of ugly if / else blocks from the code.

Obviously we don't want to do that in this PR. In the meantime, I would say that no special treatment of dimension coordinates is needed here.

jbusecke commented 2 years ago

Sorry for the long pause on this. I am back from vacation and eager to use this feature. I have trimmed the test parameterizations in the test specific to the new chunking method to a minimum, and made some comments reflecting @rabernat s comments above. Do you think this is sufficient on the code level? Ill go ahead and add a little example for the docs now.

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

jbusecke commented 2 years ago

Ok I added an example but encountered a bunch of issues with the tutorial that make it quite ugly as is. I wonder if we should merge the code changes here as is and move the docs to another PR (happy to work on that), which updates the docs and addresses #103, #104, and #105?

jbusecke commented 2 years ago

Ok I added an example but encountered a bunch of issues with the tutorial that make it quite ugly as is. I wonder if we should merge the code changes here as is and move the docs to another PR (happy to work on that), which updates the docs and addresses #103, #104, and #105?

Oh I just noticed that I accidentally included the new tutorial (with a bunch of failures) into this PR. Any opinion on how to solve this best?

rabernat commented 2 years ago

Oops. Sorry for not catching that. Could you make a new PR to fix that (and any other issues you found).

jbusecke commented 2 years ago

Yep, will do. Just waiting for #106 to be merged.