Closed alaws-USGS closed 1 year ago
I've got a test dataset and a stripped-down workflow for re-chunking. You can see the rendered version at https://gzt5142.github.io/hytest-workbook/L2/Pre/ReChunkingData.html
I believe I've got it tightly patterned to the sample notebook from NCAR... but the rechunk is corrupting the feature_id
index (silently). I can't figure out where I'm going wrong. Another set of eyes would be useful.
Low-ish priority... plenty of other to do for a while....
Problem solved. The data corruption I was experiencing had to do with not understanding what a return value was actually returning.
Learnings:
rechunker.chunk
returns, and a new zarr folder/group appears on disk -- this does not mean that the data was fully written. ONLY metadata. dask
, but STILL had to run a .execute()
on the return from rechunker.chunk
. rechunker.chunk
does not consolidate zarr metadata. Have to do that by hand after the dataset is created -- use zarr.consolidate_metadata('path/to/zarr')
Attempting to scale re-chunk pre-processing to a full-sized dataset on the ESIP hub. I think I have the skeleton of the notebook sorted out, but need a place to write the scratchspace data and the final output.
TIL that the workers on kubernetes can't see the unix filesystem, so need to write to S3. Once we get a hytest-tutorial scratch bucket, I can finish off this cloud tutorial.
I have plans to adapt it to onprem data when I get my PIV card activated and get fully credentialed.
This issue stalled/blocked until HPC access and/or S3 scratch space.
I have credentials to access nhgf-development... but am not having much luck with writing data. Can read and ls... and mkdir does not give errors when I make a folder.
But attempts to write data will thrown an AccessDenied
. I don't know if this is due to errors in the way I am going about writing to S3, or if there is a permissions thing that I've not got right.
Example Notebook at:
https://jupyter.qhub.esipfed.org/hub/user-redirect/lab/tree/shared/users/gzt5142/hytest/dataset_preprocessing/ReChunkingData_Cloud.ipynb
Problem solved.... I must have just needed to step away for a couple of days.
I can successfully demo a run of rechunker
to read anonymously from
noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr
and rechunk into time-series friendly chunks, written to:
s3://nhgf-development/workspace/testing/tutorial/rechunked.zarr
I believe that my issue earlier was not using the same credentials for the cluster spin-up as I was using for the S3 write. I had let that be handled by the env variable -- by naming it explicitly, things worked better.
In retrospect, this makes perfect sense... the kubernetes workers in the cluster need the correct credentials to write to the S3 output location. Duh. :homer:
Gene: tutorial/sample workflow for re-chunking datasets...
Model after https://github.com/NCAR/rechunk_retro_nwm_v21/blob/main/notebooks/usage_example_rerechunk_chrtout.ipynb