pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

[WIP] Incremental rechunking #28

Open davidbrochart opened 4 years ago

davidbrochart commented 4 years ago

This is a limited implementation of incremental rechunking. There is still a lot to do, but I'd like to get early feedback on the approach. Closes #8

codecov[bot] commented 4 years ago

Codecov Report

Merging #28 into master will increase coverage by 0.75%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #28      +/-   ##
==========================================
+ Coverage   88.94%   89.70%   +0.75%     
==========================================
  Files           2        2              
  Lines         190      204      +14     
  Branches       44       50       +6     
==========================================
+ Hits          169      183      +14     
  Misses         11       11              
  Partials       10       10              
Impacted Files Coverage Δ
rechunker/api.py 93.52% <100.00%> (+0.72%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fdddf0f...4506e97. Read the comment docs.

rabernat commented 4 years ago

Thanks a lot for this PR @davidbrochart! I really appreciate your contribution. I will try to give a thorough review in the next few days.

rabernat commented 4 years ago

I would wait on further action on this until #30 is merged. That is a pretty significant refactor to the internal structure of the package.

davidbrochart commented 4 years ago

Yes, I agree.

rabernat commented 4 years ago

@davidbrochart, now that #30 is done, we might want to revisit this.

Perhaps @shoyer has some ideas about how to best incorporate incremental rechunking / appending into the new code structure.

Again it seems like xarray's lazy indexing adaptors could come in very handy.

davidbrochart commented 4 years ago

@rabernat do you mean rechunker would depend on xarray, or pulling xarray's lazy indexing logic into rechunker's code?

rsemlal-murmuration commented 5 months ago

Was there any progress on this since then?

rabernat commented 5 months ago

Hi @rsemlal-murmuration - turns out that incremental rechunking is pretty tricky (lots of edge cases)! There hasn't been any work on this recently in rechunker.

However, at Earthmover, we are exploring many different approaches to this problem currently.

rsemlal-murmuration commented 5 months ago

Understood! Thanks for the quick reply!

Looking into this as well at the moment. The workaround we are considering: using rechunker to write the data slice into a new intermediate location, then appending it from there to the existing dataset using xarray.to_zarr(mode="a"). But it is obviously not the most efficient approach.

Would be interested if there are other approaches/workarounds out there.