hytest-org / hytest

https://hytest-org.github.io/hytest/
22 stars 10 forks source link

Deploy AWS parallel cluster in USGS cloud #380

Open amsnyder opened 9 months ago

amsnyder commented 9 months ago

Deploye basic AWS parallel cluster instance (no software pre-installed) in WMA account of AWS with 6 types of compute instances available:

amsnyder commented 8 months ago

Working with Brendan Wakefield wot build images with Image Factory that will be used to deploy AWS Parallel Cluster - will have 4 images (parallel cluster, parallel cluster + WRF, parallel cluster +WRF-Hydro, parallel cluster + NHM-PRMS)

ronald50928 commented 8 months ago

AWS Parallel Cluster- CloudFormation deployment is working, will start working on deploying in the Service Catalog. Also pending Image creation.

AWS Parallel Cluster- Packer+GitLab, Image creation (meeting this week, Friday Dec 1). Image creation (Brendan), to start working this Friday(pending to schedule this call and troubleshooting from Nebari taking a priority)

Presentation of AWS Parallel Cluster for HPC team- Moved from last week to this week (Topics AWS Parallel Cluster, Fine Tuning and Pre-training LLM’s) this is a casual presentation of what we are doing at HyTEST with AWS Parallel Cluster and follow up on "what AI can do at USGS for us."

ronald50928 commented 6 months ago
  1. Deployment of a 6 cluster plus storage deployment is not being consistent. -to work on storage configuration and deployment -AMI image baking, pending call with Brendan
  2. Looking to research different storage approach(s) to use 1:1 or 1:many(shared storage)
ronald50928 commented 6 months ago

Now deployment consistent Storage- to provide EFS as a starting point and provide instructions in how to use (Lustre, local, etc) Review new HPC recipes from AWS.:

ronald50928 commented 6 months ago

Added EFS as storage solution, keeping also optional for scientists to change the storage piece of the deployment.

Not making progress on Nebari- Will start working the development and build completions of AWS parallel cluster in CHS service catalog 75% completed.

Can use and manual (using CloudFormation) deployments/ properly tagged can be done to enable and support scientists in the meantime.

ronald50928 commented 5 months ago

Started testing the deployment of pcluster with members of HPC-ARC(landsart workload with Lopaka Lee

ronald50928 commented 3 months ago

HPC-ARC, CHS and the HTC Consulting group are aware now of the results of using pcluster with Todd Hawbackers story. The story has moved from we would like to do things in the cloud to know how we can do this at scale at USGS and serve more people with this product. I worked in a strategy/ path to achieve this. Provided the presentations to Janice and Al Pedraza for them to discuss.

ronald50928 commented 3 months ago

Discussed with Lee Lopaka our pcluster configuration, he advised a new queue structure (CPU and GPU) that replicates our current queue on-premises. This will create similarity to how the queues are being setup at USGS in our on-premises environments.