leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 6 forks source link

HyCOM Public Zarr #163

Open norlandrhagen opened 6 days ago

norlandrhagen commented 6 days ago

I'm at the pangeo showcase talk. Shane Elipot has a massive public ocean model Zarr output on the AWS public data program. I think it's split into 12 separate Zarr stores.

https://github.com/selipot/hycom-oceantrack?tab=readme-ov-file

Wondering if LEAP folks would find this useful? @jbusecke

jbusecke commented 5 days ago

Ohhhh that looks really cool! @dhruvbalwada might be interested in this? I guess this will work, but might not be as fast as on gcs. Wondering if we should have a badge for the 'cloud'? But either way, this would be dope to link in.

dhruvbalwada commented 5 days ago

Would be great to link to this!

norlandrhagen commented 5 days ago

Great! @dhruvbalwada if you have some background on this dataset, do you have any interest in doing a bit of exploring on which of these Zarr stores would be useful? Seems like there are Zarr stores per variable as well as lagrangian vs eulerian versions.

dhruvbalwada commented 5 days ago

@norlandrhagen I think all of these will potentially be useful. (This dataset is very complementary to a LLC4320 data that was made available through Pangeo, and has been used by many).

Is the discussion here to just provide a link to these datasets? or is something that will cost LEAP and so we have some resource constrain?

jbusecke commented 5 days ago

@dhruvbalwada the former. It will be very beneficial to get an idea how to present these stores in the catalog in a meaningful way.

dhruvbalwada commented 5 days ago

Happy to help with that, let me know what you would like me to actually do.

norlandrhagen commented 5 days ago

Awesome! Thanks for the expertise @dhruvbalwada.

I think a good start would be to see if you can access / catalog these Zarr stores.

I think the data is here, but I haven't explored it yet.

Also might be some clues here.

The data producer / speaker, Shane Elipot, seems super nice and was eager to have people using his data. I bet you/we could reach out to him with questions.

I think ideally we have a table of Zarr stores we want to add to the catalog + some metadata.

ex:

|-------------------------------------------------------------------------------------
| dataset_name_variable.       | zarr store link                                     |
|-------------------------------------------------------------------------------------
| lagrangian_HYCOM_u_component | s3://../../lagrangian_HYCOM_u_component.zarr        |
|-------------------------------------------------------------------------------------
| lagrangian_HYCOM_v_component | s3://../../lagrangian_HYCOM_v_component.zarr        |
|-------------------------------------------------------------------------------------
jbusecke commented 4 days ago

Just played around with the data a bit, and wanted to note some points:

import s3fs
import xarray as xr
fs = s3fs.S3FileSystem(anon=True)
mapper = fs.get_mapper("s3://hycom-global-drifters/lagrangian/global_hycom_0m_step_1.zarr")
xr.open_dataset(mapper, engine='zarr')

but this doesnt:

xr.open_dataset("s3://hycom-global-drifters/lagrangian/global_hycom_0m_step_1.zarr", engine='zarr')

We might need a way for the catalog to add custom kwargs to the snippet due to this!

norlandrhagen commented 4 days ago
* This dataset has a lot of different 'steps'. I have no clue if we could potentially virtually concatenate these?

This seems like a cool use case! Maybe we open up an issue in virtualizarr. It seems possible to merge the virtual zarrs.