chunking pre-processing

alaws-USGS commented 2 years ago

Gene: tutorial/sample workflow for re-chunking datasets...

Model after https://github.com/NCAR/rechunk_retro_nwm_v21/blob/main/notebooks/usage_example_rerechunk_chrtout.ipynb

gzt5142 commented 1 year ago

I've got a test dataset and a stripped-down workflow for re-chunking. You can see the rendered version at https://gzt5142.github.io/hytest-workbook/L2/Pre/ReChunkingData.html

I believe I've got it tightly patterned to the sample notebook from NCAR... but the rechunk is corrupting the feature_id index (silently). I can't figure out where I'm going wrong. Another set of eyes would be useful.

Low-ish priority... plenty of other to do for a while....

gzt5142 commented 1 year ago

Problem solved. The data corruption I was experiencing had to do with not understanding what a return value was actually returning.

Learnings:

Just because rechunker.chunk returns, and a new zarr folder/group appears on disk -- this does not mean that the data was fully written. ONLY metadata.
I was not using dask, but STILL had to run a .execute() on the return from rechunker.chunk.
rechunker.chunk does not consolidate zarr metadata. Have to do that by hand after the dataset is created -- use zarr.consolidate_metadata('path/to/zarr')

gzt5142 commented 1 year ago

Attempting to scale re-chunk pre-processing to a full-sized dataset on the ESIP hub. I think I have the skeleton of the notebook sorted out, but need a place to write the scratchspace data and the final output.

TIL that the workers on kubernetes can't see the unix filesystem, so need to write to S3. Once we get a hytest-tutorial scratch bucket, I can finish off this cloud tutorial.

I have plans to adapt it to onprem data when I get my PIV card activated and get fully credentialed.

This issue stalled/blocked until HPC access and/or S3 scratch space.

gzt5142 commented 1 year ago

I have credentials to access nhgf-development... but am not having much luck with writing data. Can read and ls... and mkdir does not give errors when I make a folder.

But attempts to write data will thrown an AccessDenied. I don't know if this is due to errors in the way I am going about writing to S3, or if there is a permissions thing that I've not got right.

Example Notebook at: https://jupyter.qhub.esipfed.org/hub/user-redirect/lab/tree/shared/users/gzt5142/hytest/dataset_preprocessing/ReChunkingData_Cloud.ipynb

gzt5142 commented 1 year ago

Problem solved.... I must have just needed to step away for a couple of days.

I can successfully demo a run of rechunker to read anonymously from

noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr

and rechunk into time-series friendly chunks, written to:

s3://nhgf-development/workspace/testing/tutorial/rechunked.zarr

I believe that my issue earlier was not using the same credentials for the cluster spin-up as I was using for the S3 write. I had let that be handled by the env variable -- by naming it explicitly, things worked better.

In retrospect, this makes perfect sense... the kubernetes workers in the cluster need the correct credentials to write to the S3 output location. Duh. :homer:

hytest-org / hytest

chunking pre-processing #41