IOOS' Coastal Modeling Cloud Sandbox provides a framework for developing, modifying and running models in the cloud. It provides repeatable configurations, model code and required libraries, input data and analysis of model outputs. The Sandbox supports not only the development of services and models, but also Cloud HPC to run and validate models.
Performance was slower than expected during a 1-year run. This was based on a 2-day run where each day took about 38 minutes to complete but during the longer run this slowed down to about 60 minutes per model day. I was running with 4, hpc6a.48xlarge instance types.
We met with AWS during the technical meeting, Friday 4/19/2024, and they had several ideas. Here are the meeting notes:
===============================================================================
April 19, 2024
AWS folks joined us (Rayette Abdulah-Toles, Aaron Bucher, John Kolman, Austin Park, Dev Jodhrun, Matt Dowling)
Discussing the LiveOcean slowdown 40min/day to 60/day.
Rayette suggests trying a HPC7a instance
There were some questions about moving from the VM instance type to a .metal instance type - Aaron B. says not worth it
Aaron asked about I/O but our impression is that all the data was loaded at the start. Data on EFS shared disk. I/O should be stable day to day.
Do we need a second adaptor for each instance - nope, that won’t help.
AWS reviewed machines and discussed best configurations (Naming conventions ..i is Intel, …a is AMD):
7a does allow a second adaptor - test 96 and 48xlarge types
c7i, r7iz (high frequency processor) - test highest in each family
r7iz will have highest clock speed
Aaron will look at our region (east 2b) and see what better instances might be available.
Michael screen-shared the benchmark testing spreadsheet and explained what we did.
Zach and Aaron discussed the different configurations and data
transfer latencies:
EBS vs EFS burst credits
Aaron was not sure why we are seeing an abrupt slow down when processing the 1 year run:
Doesn’t make sense it would suddenly take that much longer (see above - I/O should be stable day to day)
Michael reviewed the run - would jump from about 44 to 50+ minutes
Memory to bandwidth scaling?
Status of Live Ocean run
Try some new benchmarks… and do a follow up with AWS to review (see below To Do)
AWS suggestions (see details above):
Parallel processing - run a bunch of jobs and see what work fastest.
Try HPC7a, r7iz, c7i (see details above)
Change up the storage (EFS may not be optimal in all cases, Lustre could be an option):
Run 6a but change storage type to Lustre
Focus on testing just the largest instance type for each family
Next steps
12 years in one year chunks via Parker’s description
NODD application
Status of CORA
test data for arrived yet?
plan for east coast test run
documentation
Status of the CO-OPS/OCS space
Possible OCS ?ADCIRC? run - convo with Saeed and Co.
Data sources
Code
To Do:
Follow up with AWS group in 1 week (Tiffany added them to the weekly working session, so they can join whenever!)
Performance was slower than expected during a 1-year run. This was based on a 2-day run where each day took about 38 minutes to complete but during the longer run this slowed down to about 60 minutes per model day. I was running with 4, hpc6a.48xlarge instance types.
We met with AWS during the technical meeting, Friday 4/19/2024, and they had several ideas. Here are the meeting notes:
=============================================================================== April 19, 2024
AWS folks joined us (Rayette Abdulah-Toles, Aaron Bucher, John Kolman, Austin Park, Dev Jodhrun, Matt Dowling) Discussing the LiveOcean slowdown 40min/day to 60/day. Rayette suggests trying a HPC7a instance There were some questions about moving from the VM instance type to a .metal instance type - Aaron B. says not worth it Aaron asked about I/O but our impression is that all the data was loaded at the start. Data on EFS shared disk. I/O should be stable day to day.
Do we need a second adaptor for each instance - nope, that won’t help.
AWS reviewed machines and discussed best configurations (Naming conventions ..i is Intel, …a is AMD): 7a does allow a second adaptor - test 96 and 48xlarge types c7i, r7iz (high frequency processor) - test highest in each family r7iz will have highest clock speed Aaron will look at our region (east 2b) and see what better instances might be available. Michael screen-shared the benchmark testing spreadsheet and explained what we did. Zach and Aaron discussed the different configurations and data transfer latencies: EBS vs EFS burst credits Aaron was not sure why we are seeing an abrupt slow down when processing the 1 year run: Doesn’t make sense it would suddenly take that much longer (see above - I/O should be stable day to day) Michael reviewed the run - would jump from about 44 to 50+ minutes Memory to bandwidth scaling?
Status of Live Ocean run Try some new benchmarks… and do a follow up with AWS to review (see below To Do) AWS suggestions (see details above): Parallel processing - run a bunch of jobs and see what work fastest.
Try HPC7a, r7iz, c7i (see details above) Change up the storage (EFS may not be optimal in all cases, Lustre could be an option): Run 6a but change storage type to Lustre Focus on testing just the largest instance type for each family Next steps 12 years in one year chunks via Parker’s description NODD application Status of CORA test data for arrived yet? plan for east coast test run documentation Status of the CO-OPS/OCS space Possible OCS ?ADCIRC? run - convo with Saeed and Co. Data sources Code
To Do: Follow up with AWS group in 1 week (Tiffany added them to the weekly working session, so they can join whenever!)
===============================================================================