hytest-org / hytest

https://hytest-org.github.io/hytest/
22 stars 10 forks source link

Storage space for intermediate datasets from tutorials on Qhub #12

Closed alaws-USGS closed 1 year ago

alaws-USGS commented 2 years ago

Where to store intermediate datasets created by users working through tutorials on Qhub?

What follows is the start of the conversation from MS Teams (select and shortened responses): @alaws-USGS: Where we were going to have users store intermediate data on the S3 bucket? @amsnyder: what are the intermediates though? The files I can think of in the pipeline are the two I mentioned, the output scorecard results and maybe some visualizations if we wanted to save them somewhere @thodson-usgs: users would need write access for the s3 then? maybe the server fs is a better scratch space @alaws-USGS: Data would be data such as CONUS404 or NWM clipped to an AOI or reference data (vector and raster) clipped to an AOI to be used between notebooks @thodson-usgs: yeah, use the server fs and we'll explore other options when needed @amsnyder: if it's really big and that's a problem, then we consider doing the processing for the user (for all spatial units at the needed level) and sticking the output on S3. Being done on HPC? @alaws-USGS: QHUB @amsnyder: can you try it out and see if you can write to your filespace there?

thodson-usgs commented 2 years ago

I think so long as you write to relative paths you'll be fine, e.g. '~/project_data/'

or else set the appropriate root workspace as an environment variable depending on the server

alaws-USGS commented 2 years ago

@thodson-usgs, that would be a good note to add in the notebook but I think we need a definitive spot for those on hytest to use.

gzt5142 commented 2 years ago

specific to tutorial/learning objectives -- I'm doing a quick explainer on re-chunking... would like to have the learner write a (large) re-chunked dataset someplace temporary as part of the learning objectives of the tutorial.

Currently, I am mimicing an NCAR notebook, reading data from s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr. The output dataset will be restricted to about 7k feature_id's, but still sizeable. Where is the right place for me to write that data... it won't persist 24hours, I expect.

amsnyder commented 2 years ago

So the hope is that we have some consolidated space for the user to write these files to. Then we would clear that space on some kind of schedule and tell the user they need to download a copy to another space if they want to maintain it long-term. Is this what we are hoping for?

amsnyder commented 2 years ago

We do not have a good setup for this right now. The S3 space we have requires credentials to write to, and we can't just share the current credentials with every user. We could look into having a publicly writeable space...but perhaps the AWS storage pod that's coming in a few months is our solution? How much of a blocker is it to get the answer to this question now? Can you write to the space where the notebook is for now and we swap out that line of code once we find a location?

sfoks commented 2 years ago

specific to tutorial/learning objectives -- I'm doing a quick explainer on re-chunking... would like to have the learner write a (large) re-chunked dataset someplace temporary as part of the learning objectives of the tutorial.

Currently, I am mimicing an NCAR notebook, reading data from s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr. The output dataset will be restricted to about 7k feature_id's, but still sizeable. Where is the right place for me to write that data... it won't persist 24hours, I expect.

maybe a separate topic, but would the notebooks on ERA5 rechunking help Gene @gzt5142? These haven't been moved over yet to hytest-org but @rsignell-usgs had some work in here on ERA5 pull and rechunking. Might be helpful for writing tutorial?

gzt5142 commented 2 years ago

Can you write to the space where the notebook is for now and we swap out that line of code once we find a location?

Yes. Easy.

from a tutorials/learning perspective, it would be ideal (but perhaps not realistic) to mimic what the real-world workflow looks like, in terms of the storage mechanism.... if real-world is to write to S3 somewhere, the tutorial should also (but very likely a different bucket). That can come later... can write to local file system in the mean while.

So... I am taking 'locally' to mean writing to the filesystem on QHub. Is that right?

amsnyder commented 2 years ago

Right! for now just write to wherever you're running the notebook on QHub. @rsignell-usgs and I can discuss other more permanent solutions when he is back. Maybe it is our storage pod, or maybe we need a publicly-writeable S3 bucket that we wipe on some schedule.

sfoks commented 2 years ago

As for options for users, this is a bigger discussion perhaps. Maybe we have :

scratch to intermediate [some name here]: (cleared of files older than one week)

intermediate-longterm [some name here]:

alaws-USGS commented 1 year ago

@amsnyder and @rsignell-usgs, have you had an opportunity to discuss the publicly writeable S3 bucket as mentioned above? The WIM F&E work could be using the space as of now and will definitely need this in a week or two.

@sfoks, I think your divisions of storage solutions are spot on.

amsnyder commented 1 year ago

No we have not discussed yet - in my mind this was fairly low priority because it isn't blocking any work from being done (because you can just save files to whatever system you are working on). Am I missing any information that would bump up the priority of this?

gzt5142 commented 1 year ago

I just learned that we have access to a temporary workspace available at

s3://nhgf-development/workspace/

I will be updating cloud-focused tutorials and demos to write to that space. There is not a purge policy that I know of, so good-neighbor behavior is to clean up after writing temp data in there.

alaws-USGS commented 1 year ago

@amsnyder, with doing the work on QHub, the intermedia xarray datasets may be too large to write to the ESIP QHub workspace. I will try to and let you know for sure. I will also try the S3 bucket mentioned above.

gzt5142 commented 1 year ago

No we have not discussed yet - in my mind this was fairly low priority because it isn't blocking any work from being done (because you can just save files to whatever system you are working on). Am I missing any information that would bump up the priority of this?

It's not a high priority from my perspective... although it is causing me to table a task that I'm almost done with. It came up for me this morning as I was working via some dask workers who need to write out data -- and S3 is where kubernetes workers can write (they can't see the local file system on ESIP/QHub).

thodson-usgs commented 1 year ago

@gzt5142, would you link to the relevant NCAR notebook here? I'm trying to understand the use case better. Or your notebook, whichever you think would be more informative.

rsignell-usgs commented 1 year ago

When we use the Dask workers on Kubernetes, we achieve great data throughput because each container is independently reading and writing to object storage (s3 on AWS). These workers can't even see the filesystem on the Jupyterhub! So we need to have rechunking workflows on the Cloud using object storage. Individual users should probably create their own directories under s3://nhgf-development/workspace/ so that they don't overwrite each others stuff!

Also @gzt5142 , it's not just the intermediate files that should be written to s3, but the final rechunked data as well!

amsnyder commented 1 year ago

s3://nhgf-development/workspace/ is not publicly writeable. I can share the credentials to write there with @alaws-USGS and @gzt5142 as an intermediate solution, but it is not a solution for our users. Let me know if you would like those credentials for now.

@rsignell-usgs - can the Open Storage Network Pod have a publicly writeable location that our users could write to, and we wipe out on some schedule? Do you have a time estimate of when that comes online? I think it was later this year?

gzt5142 commented 1 year ago

@gzt5142, would you link to the relevant NCAR notebook here? I'm trying to understand the use case better. Or your notebook, whichever you think would be more informative.

https://jupyter.qhub.esipfed.org/hub/user-redirect/lab/tree/shared/users/gzt5142/hytest-workbook/L2/Pre/ReChunkingData_Cloud.ipynb

This is a cloud-focused extension to the re-chunking demo/tutorial at

https://jupyter.qhub.esipfed.org/hub/user-redirect/lab/tree/shared/users/gzt5142/hytest-workbook/L2/Pre/ReChunkingData.ipynb

These are both under development... I've not pushed to the proper repo yet; they kind of grew out of my experimentation with jupyterbook. I'll relo them as soon as I get a landing spot sorted.

rsignell-usgs commented 1 year ago

s3://nhgf-development/workspace/ is not publicly writeable. I can share the credentials to write there with @alaws-USGS and @gzt5142 as an intermediate solution, but it is not a solution for our users. Let me know if you would like those credentials for now.

@rsignell-usgs - can the Open Storage Network Pod have a publicly writeable location that our users could write to, and we wipe out on some schedule? Do you have a time estimate of when that comes online? I think it was later this year?

Yes, we could have a writable bucket for our users on the OSN pod that we could wipe on a schedule, but you still need credentials to write to any bucket. So I'm not sure why the same approach wouldn't work for our users with s3://nhgf-development/workspace (or some other CHS bucket)

The $$ for the OSN pod arrived at WHOI on Friday, one day after the quote for the OSN pod expired. So we are getting a new quote. I'll ask when the ETA for regular use of the OSN pod might be.

amsnyder commented 1 year ago

Ok here's what I'm thinking as a long-term solution:

This will allow any USGS employee to use our workflows on the cloud. For any external user that wants to use our workflows on the cloud, we expect them to set up their own user credentials and scratch bucket for writing.

In the meantime, we will tell any user that wants to use our workflows to set up their own bucket and credentials using their project funds.

Does this seem reasonable?

alaws-USGS commented 1 year ago

@amsnyder, sounds like a good plan to me :)

alaws-USGS commented 1 year ago

It's not a high priority from my perspective... although it is causing me to table a task that I'm almost done with. It came up for me this morning as I was working via some dask workers who need to write out data -- and S3 is where kubernetes workers can write (they can't see the local file system on ESIP/QHub).

For writing to QHub, I was going to see if some of the built-in impletentations from fsspec would allow us to interface with it. I'm thinking that either fsspec.implementations.jupyter.JupyterFileSystem or fsspec.implementations.dask.DaskWorkerFileSystem might work documentation

thodson-usgs commented 1 year ago

I'm slow, this a Rechunker discussion?

@amsnyder, what region is the source data in? It should be the same as the OSN pod, right?

Also consider: 1) In general, the demo notebook will clean up the intermediate bucket after rechuncker executes. 2) chron job that wipes old files from the bucket

2 alone is safe and sufficient, but consider 1 as "best practice " for teaching purposes.

rsignell-usgs commented 1 year ago

@thodson-usgs , I think this is really a "how can users write to object storage" discussion, which is a key part of cloud-native workflows. Rechunking is just one example that needs it. We want people reading from and writing to object storage -- using filesystems on JupyterHub is not the way to go.

I think Amelia's proposal is sound.

For writing to QHub, I was going to see if some of the built-in implementations from fsspec would allow us to interface with it. I'm thinking that either fsspec.implementations.jupyter.JupyterFileSystem or fsspec.implementations.dask.DaskWorkerFileSystem might work documentation

I might be misunderstanding, but I don't think we need this. We can write to any bucket (CHS, ESIP, OSN) from anywhere as long as we have the credentials.

amsnyder commented 1 year ago

Good question Tim - maybe we are getting our wires crossed here. Dataset pre-processing (rechunking) which uses dask is done by HyTEST and we already have a space to write that data. A user does not need to write that data. The outputs of the model eval analysis and visualization notebooks don't need dask, so we can't we write in QHub if they are running the notebooks on the cloud?

There may be a very small use case for a user who is using dask and needs to write out data if they are doing a tutorial of how to rechunk data, but that is the only think I can think of. Or maybe if a user is trying to recreate our whole pipeline for a new dataset (and they are rechunking the data with dask to get it ready). Is this what we are talking about @gzt5142 and @alaws-USGS?

I think our data is in us-west-2 on AWS. @rsignell-usgs would need to answer any questions about the pod because I don't know too much about it.

And I was also thinking of a cron job to clean the bucket - so glad we are thinking the same thing!

thodson-usgs commented 1 year ago

Right, I thought the "Storage space for intermediate datasets" was primarily for the rechunking demo use case. I think we agree though, that most workflows won't require an "intermediate" bucket.

rsignell-usgs commented 1 year ago

But any data we create (in demos or otherwise) should be written to object storage.

Regarding the Pod: the OSN pod will be physically located in western mass. But it's on a 100GbE+ network. And so is AWS. So access from us-west-2 compute (ESIP qhub, USGS pangeo/qhub on AWS) is actually very speedy (at least it was in the initial testing I did to see if this was viable).

If it lives for 5 years, the OSN pod will be 5x cheaper for storage (compared to AWS S3), requires no credentials to read, and has no egress costs. And it supports the same S3 API.

gzt5142 commented 1 year ago

I think this is really a "how can users write to object storage" discussion, which is a key part of cloud-native workflows. Rechunking is just one example that needs it.

This is how I am thinking of it. The original use-case in my mind is to provide a temporary space where tutorials can write data (or read sample datasets)... to emulate workflows used in 'production'. Rechunking is one example. I suppose it would be just as useful to run these tutorials onprem, and write to a temporary space on a filesystem -- but that would remove the cloud/hub platform as a workspace.

gzt5142 commented 1 year ago

Success in writing to the OSN object storage in testing with NWS pre-processing workflows / tutorials.

I will refactor all of my big-data writes to this space to demo the process... I understand that we may not want tutorial readers to replicate by writing to this same space -- still a TBD decision.

alaws-USGS commented 1 year ago

Based on @gzt5142 success with writing to OSN and the outcome of previous conversations with using OSN for storing the tutorial intermediate datasets, this issue can be closed as it has been resolved.