pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Example pipeline for AMPS output stored at NCAR #8

Open porterdf opened 4 years ago

porterdf commented 4 years ago
## Source Dataset

Retrieving output from AMPS archive on NCAR HPSS (tape archive quickly nearing its End-Of-Life) for public storage on Google Cloud Services and access by Pangeo and more generally xarray

The Antarctic Mesoscale Prediction System (AMPS) is a real-time weather forecasting tool primarily in support of the NSF's United States Antarctic Program (USAP). It consists of the assimilation of surface and upper air observations into the Weather Research and Forecasting (WRF) model - forced at its boundaries by the GFS model. There are two outer nested pan-Antarctic domains with several additional higher resolution domains over areas-of-interest.

Transformation / Alignment / Merging

The raw WRF output files are NetCDF (and GRIB) format and usable by many software packages, however several common post-processing procedures can both make the data smaller and more usable (e.g. converting to pressure level data, subsetting fields, and de-staggering winds).

Output Dataset

# In both cases, these raw or post-processed NetCDF files should be converted to a zarr object for optimization in cloud-based xarray routines. Ideally this conversion (e.g. using xarray method to_zarr) would occur either on the NCAR HPC or within Pangeo or similar cloud computing environment.

rabernat commented 4 years ago

This is an interesting example because it requires ssh access to the supercomputer. That sounds very hard to automate. For example, my NCAR login uses two-factor authentication--it's impossible to use from a script.

We would want to consult with CISL about the best way to do this. The page on data transfers has lots of useful information.

One thing that might work is the following:

davidbrochart commented 4 years ago

it requires ssh access to the supercomputer. That sounds very hard to automate

Could pexpect help?

porterdf commented 4 years ago

Yes, I figured this use-case was just different enough from the existing staged recipe, and potentially more broadly applicable, that it's challenges are worth thinking about.

I am currently doing (most of) what the recipe describes, though much of it incrementally and by hand. Scripts retrieve data from HPSS in a 'friendly' way, while others transfer either directly to Google Cloud bucket or our group's linux server.

This example pipeline was submitted at the suggestion by @tjcrone, who's working with Jonny Kingslake and myself on some glaciology sub-projects.

rabernat commented 3 years ago

If we can get a bakery running somewhere inside the NCAR network, that should make this possible.

See https://github.com/pangeo-forge/pangeo-forge/issues/41 for GRIB inputs.

rabernat commented 3 years ago

This should be ready to go now, if we can figure out a way to automatically suck files out of HPSS / Glade. @mgrover1 mentioned that he might be able to work with the CISL folks to help figure this out.

Max, what kind of options do you see? Globus would probably be the default way to go. But Globus works pretty differently than our existing setup. I also imagine the authentication is a pain. It would be great if we could use one of the existing implemented fsspec protocols to pull data from NCAR.

mgrover1 commented 3 years ago

It seems like GLOBUS is the recommended way to go... do you have sample filepaths I can test with?

rabernat commented 3 years ago

You would need to ask @porterdf, the creator of this example.

However, I have concerns about globus, as noted above. There are many obstacles to implementing globus-based transfer, most practically, the lack of an fsspec implementation for globus. I also find the globus API extremely confusing. Is there any way we could just use http, scp, ftp, anything else from the list?

mgrover1 commented 3 years ago

Okay - I will reach out to him.

There are scp and sftp options, although the CISL documentation explicitly states "They are best suited for transferring small numbers of small files (for example, fewer than 1,000 files totaling less than 200 MB). For larger-scale transfers, we recommend using Globus."

porterdf commented 3 years ago

I've been transferring files over in piecemeal fashion, mostly using the methodology scathed out in our repo's readme: pangeo-AMPS

However there is still more AMPS output we will be moving to GCP eventually, so happy to contribute anyway I can, and kick it down the road a bit further. I've created a directory on Glade for some files to be transferred - perhaps they are ripe for testing a recipe?

/glade/scratch/porterdf/AMPS/WRF_24/domain_03

mgrover1 commented 3 years ago

@rabernat would the bakery move all these files at the same time (such that if they were in small chunks, they would be under the recommended limit?) I am planning on testing this workflow with the CESM2-LE dataset as well... there is also an internal s3-like storage system (called stratus) which I found out about this week (not sure if there would be a way to "stage" files on there)?

rabernat commented 3 years ago

would the bakery move all these files at the same time

The default mode of operation is to just start up a couple simultaneous http transfers.

I think that we should find a way to use globus for these transfers from NCAR. The transfer would have to occur outside of the recipe itself. We will have to think about how to integrate pangeo forge with other "file transfer services".

there is also an internal s3-like storage system (called stratus) which I found out about this week (not sure if there would be a way to "stage" files on there)?

If it's on the public internet, yes, this would be useful.

porterdf commented 3 years ago

I had no problems using GLOBUS to transfer WRF output from NCAR to a local server, but there are additional steps/features/premium options needed to set up a GCP bucket as an endpoint (I hit this wall, and reverted to parallel rsync'ing through gsutil)

jkingslake commented 3 years ago

Just a note here to say that I have been looking at ISMIP6 data as another option for a pangeo-forge recipe and it is stored at ghub, who pointed me to Globus to access the data. @porterdf, I suspect I ran into the same issues you did when looking into setting up an endpoint in a GCS bucket. Did you find these instructions and then get lost, like I did?

rabernat commented 3 years ago

I think we need to talk to the Globus folks about the best way to have Globus work with Pangeo Forge.

ricardobarroslourenco commented 2 years ago

@rabernat any updates on pinging someone from Globus on this?

mgrover1 commented 2 years ago

@ricardobarroslourenco I invited some of the Globus folks to the meeting on Monday, and pointed them to #222 over in the Pangeo Forge Recipes repo

rabernat commented 2 years ago

We made some progress on globus in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/222. The key trick is to create a Globus Guest Collection, which allows you to access the data over HTTP. So using this mechanism, we can access globus data today with Pangeo Forge.

However, I have learned from CISL support that NCAR does not have a Globus v5 subscription yet, which means it is not possible to create guest collections on Glade. Anyone reading this who wants to move the issue forward should reach out to NCAR and encourage them to upgrade their subscription.

We can also continue to pursue a Globus client for Pangeo Forge. That would allow us to use the existing recommend way to share NCAR data via Globus. Because that requires a globus login (rather than vanilla HTTP access), it is more complicated on our end.

jordanplanders commented 1 year ago

@rabernat I was interested in fixing up a recipe for iTRACE output, but came up against an authentication issue. I don't have a nuanced understanding of what data lives where and is accessible by what means, but in the course of hoping that a Globus-NCAR pipeline had sprung up recently, I stumbled across a headline that maybe NCAR upgraded its subscription.

(Also, thanks for all the useful tidbits of information you seem to pepper this corner of the internet with. On multiple occasion, your breadcrumbs have either solved my problem or made it clear that they would require an entirely different approach.)

TomNicholas commented 1 year ago

However, I have learned from CISL support that NCAR does not have a Globus v5 subscription yet, which means it is not possible to create guest collections on Glade.

FYI I checked and NCAR CISL confirmed that they have since transitioned to Globus v5 (in November 2022). (cc @jbusecke)