hytest-org / hytest

https://hytest-org.github.io/hytest/
22 stars 10 forks source link

Evaluation Training Compute Platform #141

Closed amsnyder closed 1 year ago

amsnyder commented 1 year ago

Question posed to @rsignell-usgs and I by @sfoks:

Evaluation` is supposed to have a training class on model benchmarking sometime in the spring (Mar 7, 8, 9 tentative dates). We are wondering about what compute platform to run this training on... It would be a mix of lecture/lab (do it yourself)/and talk about it. Though I'm not entirely sure of class size yet or attendee limit we should impose. Depending on the class size, I'm guessing we have some different options for where we should run this class (HPC- tallgrass?, denali? esip-qhub?). What do you two think would be best?

amsnyder commented 1 year ago

Rich is probably better equipped to answer this, given that I only have a limited understanding of the process to reserve nodes on the HPCs, but my initial thoughts are

rsignell-usgs commented 1 year ago

I would love to run that on Qhub (now Nebari) on CHS on a WMA dev account. We have a plan to make that happen in the next quarter.

But even if we can't get that deployed, then with that much lead time we could work to make sure that the current https://pangeo.chs.usgs.gov conda environments are set up to run the notebooks needed for the course.

I'd much rather run the training on the cloud than on Denali or Tallgrass, as those machines are more appropriate for running models and ML, and we are trying to show the power of cloud for analysis/visualization/collaboration.

gzt5142 commented 1 year ago

In terms of the notebooks that I had a hand in writing, the cloud platform is going to be better than the HPC. They were designed and tested for the cluster and gateway config on that platform.

Those notebooks read and write substantial data using object storage. Credentials will be a concern for a learning environment, to ensure they can write sample outputs to a suitable spot.

sfoks commented 1 year ago

Alright it sounds like cloud is best; either the pangeo chs or Nebari on WMA dev account.

Should there be a class size limit for this? We would be focusing on the streamflow benchmarking workbooks Gene @gzt5142 has rewritten. I could see an attendee writing 200-500 Kb of information, not a whole lot of data being written out per user (ideally).

sfoks commented 1 year ago

@rsignell-usgs @amsnyder any updates on getting Qhub on CHS?

rsignell-usgs commented 1 year ago

We are hoping to hire a contractor to help with getting this going. Had a good interview with a candidate yesterday.

sfoks commented 1 year ago

For either training environment (pangeo.chs.gov or nebari on chs), would we need all class participants to be added to these groups in particular or would they all just need to get CHS.usgs accounts? I'm trying to remember the process for this...

sfoks commented 1 year ago

Talked with Sam Congdon and Courtney Neu, they say pangeo.chs.usgs.gov is okay and to give them a list of participants at least 1 week in advance. Thanks everyone!

sfoks commented 1 year ago

The Nebari-workshop training platform worked really well for us! thanks again for setting this up!

rsignell-usgs commented 1 year ago

We used the nebari-workshop training platform yesterday for a 2-hour clinic at the CSDMS meeting here in Boulder. We had 34 users who logged into the platform and successfully ran a variety of notebooks firing up dask clusters. I didn't show anyone how to login until I had given an initial powerpoint overview and ran a demo (to hopefully prevent people dinking around on the platform while I was presenting).

After the initial presentation/demo, we followed this approach for onboarding to the training platform:

  1. attendees login to https://nebari-workshop.esipfed.org/ with email address and password 12
  2. they are prompted to change their password
  3. they pick the "normal server"
  4. they open a terminal from the launcher
  5. they type "python /shared/users/start.py"

The start.py script creates their folder in the shared area, clones the workshop repo into that folder, and also copies bucket credentials to their ~/.aws folder.