ioos / Cloud-Sandbox

IOOS' Coastal Modeling Cloud Sandbox provides a framework for developing, modifying and running models in the cloud. It provides repeatable configurations, model code and required libraries, input data and analysis of model outputs. The Sandbox supports not only the development of services and models, but also Cloud HPC to run and validate models.
https://www-sandbox.ioos.us/
BSD 3-Clause "New" or "Revised" License
15 stars 12 forks source link

Conduct a model run review and brainstorm LiveOcean performance improvements #82

Open Michael-Lalime opened 7 months ago

Michael-Lalime commented 7 months ago

Performance was slower than expected during a 1-year run. This was based on a 2-day run where each day took about 38 minutes to complete but during the longer run this slowed down to about 60 minutes per model day. I was running with 4, hpc6a.48xlarge instance types.

We met with AWS during the technical meeting, Friday 4/19/2024, and they had several ideas. Here are the meeting notes:

=============================================================================== April 19, 2024

AWS folks joined us (Rayette Abdulah-Toles, Aaron Bucher, John Kolman, Austin Park, Dev Jodhrun, Matt Dowling) Discussing the LiveOcean slowdown 40min/day to 60/day. Rayette suggests trying a HPC7a instance There were some questions about moving from the VM instance type to a .metal instance type - Aaron B. says not worth it Aaron asked about I/O but our impression is that all the data was loaded at the start. Data on EFS shared disk. I/O should be stable day to day.
Do we need a second adaptor for each instance - nope, that won’t help.
AWS reviewed machines and discussed best configurations (Naming conventions ..i is Intel, …a is AMD): 7a does allow a second adaptor - test 96 and 48xlarge types c7i, r7iz (high frequency processor) - test highest in each family r7iz will have highest clock speed Aaron will look at our region (east 2b) and see what better instances might be available. Michael screen-shared the benchmark testing spreadsheet and explained what we did. Zach and Aaron discussed the different configurations and data transfer latencies: EBS vs EFS burst credits Aaron was not sure why we are seeing an abrupt slow down when processing the 1 year run: Doesn’t make sense it would suddenly take that much longer (see above - I/O should be stable day to day) Michael reviewed the run - would jump from about 44 to 50+ minutes Memory to bandwidth scaling?

Status of Live Ocean run Try some new benchmarks… and do a follow up with AWS to review (see below To Do) AWS suggestions (see details above): Parallel processing - run a bunch of jobs and see what work fastest.
Try HPC7a, r7iz, c7i (see details above) Change up the storage (EFS may not be optimal in all cases, Lustre could be an option): Run 6a but change storage type to Lustre Focus on testing just the largest instance type for each family Next steps 12 years in one year chunks via Parker’s description NODD application Status of CORA test data for arrived yet? plan for east coast test run documentation Status of the CO-OPS/OCS space Possible OCS ?ADCIRC? run - convo with Saeed and Co. Data sources Code

To Do: Follow up with AWS group in 1 week (Tiffany added them to the weekly working session, so they can join whenever!)

===============================================================================