pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Assertion Error `assert temp_store_or_group is not None` trying to use AWS S3 as intermediate store #32

Open abarciauskas-bgse opened 4 years ago

abarciauskas-bgse commented 4 years ago

👋 First off - I'm very grateful and excited for this package and have used it to rechunk a large array (MUR SST analysed_sst)

Now I am trying to rechunk the entire mursst group, which includes lat, lon and time coordinates and an additional 3 variables (but just focusing on analysed_sst for now).

I'm hoping for some advice on troubleshooting the following error:

s3 = s3fs.S3FileSystem(client_kwargs=dict(region_name='us-west-2'))
s3_store = s3fs.S3Map(root='mur-sst/zarr', s3=s3, check=False)
ds_zarr = zarr.open_consolidated(s3_store, mode='r')
s3_rechunk_store = s3fs.S3Map(root='example-bucket/group.zarr', create=True, s3=s3)
s3_tmp_store = s3fs.S3Map(root='example-bucket/tmp-group.zarr', create=True, s3=s3)

target_chunks = {
    'analysed_sst': {'time': 6443, 'lat': 100, 'lon': 100},
    'lat': None,
    'lon': None,
    'time': None
}
max_mem = '2GB'

array_plan = rechunk(ds_zarr, target_chunks, max_mem, s3_rechunk_store, s3_tmp_store)
array_plan

When I run this I get the following assertion error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-41-10abc73093a1> in <module>
      7 max_mem = '2GB'
      8 
----> 9 array_plan = rechunk(ds_zarr, target_chunks, max_mem, s3_rechunk_store, s3_tmp_store)
     10 array_plan

/opt/conda/envs/zarr_rechunker/lib/python3.8/site-packages/rechunker/api.py in rechunk(source, target_chunks, max_mem, target_store, temp_store)
    204 
    205         for array_name, array_target_chunks in target_chunks.items():
--> 206             delayed = _rechunk_array(
    207                 source[array_name],
    208                 array_target_chunks,

/opt/conda/envs/zarr_rechunker/lib/python3.8/site-packages/rechunker/api.py in _rechunk_array(source_array, target_chunks, max_mem, target_store_or_group, temp_store_or_group, name, source_storage_options, temp_storage_options, target_storage_options)
    318     else:
    319         # do intermediate store
--> 320         assert temp_store_or_group is not None
    321         int_array = _zarr_empty(
    322             shape, temp_store_or_group, int_chunks, dtype, name=name

AssertionError: 

Which is odd because if I try and use local storage (e.g. replace s3_tmp_store with a local directory 'tmp.zarr') the plan is created, and I can start executing the plan but know that the the plan will eventually fail because there is not enough local disk storage.

Any advice is welcome. Here is the notebook I am using: Rechunking_Zarr.ipynb

TomAugspurger commented 4 years ago

I haven't checked closely, but I think right now rechunker will refuse to overwrite existing data. Can you try removing the objects at example-bucket/tmp-group.zarr (and example-bucket/group.zarr if there are any) before doing the rechunk?

As a user, what do you think would be the best behavior here? Overwrite any objects / files that are already present probably?

rabernat commented 4 years ago

@abarciauskas-bgse -- thanks so much for reporting this bug!

I'm confused why this is happening. Can you try just running

assert s3_tmp_store is not None

on its own? I don't understand why that assertion statement is getting triggered here when you are passing a valid store. I don't think that @TomAugspurger's suggestion related to overwriting is relevant here, because that would raise a different error.

TomAugspurger commented 4 years ago

Sorry, my comment can be ignored. I wasn't looking closely enough.

On Mon, Jul 27, 2020 at 9:50 AM Ryan Abernathey notifications@github.com wrote:

@abarciauskas-bgse https://github.com/abarciauskas-bgse -- thanks so much for reporting this bug!

I'm confused why this is happening. Can try just running

assert s3_tmp_store is not None

on its own? I don't understand why that assertion statement is getting triggered here when you are passing a valid store. I don't think that @TomAugspurger https://github.com/TomAugspurger's suggestion related to overwriting is relevant here, because that would raise a different error.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/rechunker/issues/32#issuecomment-664443213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIRZPLGPWKB5CBTMKWTR5WH25ANCNFSM4PIDLVTA .

abarciauskas-bgse commented 4 years ago

@rabernat @TomAugspurger thanks for your responses. I did some more troubleshooting and it looks like the issue is if the S3 path to the temp store doesn't exist in the S3 location path then temp_store_or_group will be None and the assertion will raise. So workaround is just to create the S3 path, which works for me! But is a bit different from behavior if the temp store is local (will create the path for you).

Happy to close this issue or leave open until this scenario is handled (which I can look into if that helps, just might take me more time than someone more familiar with the codebase).

rabernat commented 4 years ago

I did some more troubleshooting and it looks like the issue is if the S3 path to the temp store doesn't exist in the S3 location path then temp_store_or_group will be None and the assertion will raise. So workaround is just to create the S3 path, which works for me!

This is very good to know. I did all my debugging with GCS, where this behavior is very different. If the path doesn't exist in GCS, it gets created automatically.

I think we need to either / both:

  1. Open an issue in s3fs to document this
  2. Mention this workaround in the rechunker documentation
rabernat commented 4 years ago

I'll just ping @martindurant to see if he has any thoughts on the different behavior in s3fs vs gcsfs.

martindurant commented 4 years ago

Coming in late: do I understand that something different is happening when you create a mapper on s3 versus gcsfs? I would imagine this is an fs.exists implementation issue, which is tricky to do for a directory on object stores. Since quite a lot changed with the async rewrite, it would be worth checking if this is still the case on gcsfs master and https://github.com/dask/s3fs/pull/336 (not yet merged).

rabernat commented 4 years ago

do I understand that something different is happening when you create a mapper on s3 versus gcsfs?

Yes, we think this is the case, but no one has sat down to test it carefully.

rabernat commented 4 years ago

So workaround is just to create the S3 path

@abarciauskas-bgse: can you share the code you use to create the S3 path?

abarciauskas-bgse commented 4 years ago

@rabernat

s3_tmp_store = s3fs.S3Map(root='ds-projects/eodc/mursst/tmp.zarr', create=True, s3=s3)

Which is used within the notebook : https://github.com/abarciauskas-bgse/mur_sst-to-zarr/blob/master/images/rechunker/Rechunking_Zarr.ipynb