ufs-community / ufs-weather-model

UFS Weather Model
Other
142 stars 249 forks source link

AQM EBI Euler convergence failure with CMAQ5.4 for the AQM_NA_9km grids (much longer runtime with the UFSWM) #2386

Open JianpingHuang-NOAA opened 3 months ago

JianpingHuang-NOAA commented 3 months ago

Description

I completed several tests with the AQM updated with CMAQv54 by using both aqm_dev worfklow and PRODUCTOIN/AQMV7 based workflow for the AQM_NA_9km grids, and noticing the EBI solver convergence failure which led to much longer runtime for the UFSWM forecast job. However, I did not see a similar issue when I tested it with CMAQ5.2.1. In addition, I did reduce the DT_ATM from 150 s to 120 s and did not see any help.

To Reproduce:

Please check out the package from my personal GitHub account (Here)

PRODUCTION/AQMV7 based AQMv8.0 package (Cactus) /lfs/h2/emc/physics/noscrub/jianping.huang/nwdev/packages/aqm.v8.0.0c

or aqm_dev based workflow (Cactus)

/lfs/h2/emc/physics/noscrub/jianping.huang/nwdev/packages/aqm.v7.2.0c

Output

output logs

run_dir and log files can be found on Cactus at

/lfs/h2/emc/physics/noscrub/jianping.huang/data/run_fcst.id_1722610883_2023070100_v720c

ytangnoaa commented 3 months ago

The EBI solver is a warning message, which also appeared on 13km run and/or CMAQ 5.2. The failure looks caused by

forrtl: severe (174): SIGSEGV, segmentation fault occurred nid001252.cactus.wcoss2.ncep.noaa.gov: rank 4259 exited with code 174 forrtl: error (78): process killed (SIGTERM)

You may try to increase the memory allocation to see whether it can make difference.

JianpingHuang-NOAA commented 3 months ago

@ytangnoaa Thanks for looking into this. The segmentation fault issue appears to be associated with the aqm_dev-based workflow (i.e., aqm.v7.2.0c package). I did not encounter this issue when using the PRODUCTION/AQMv7-based workflow (i.e., aqm.v8.0.0c package).

Please see another test conducted with the AQMv8.0 package for the 9-km domain, located at /lfs/h2/emc/physics/noscrub/jianping.huang/data/run_fcst.146552693.cbqs01_v800c (this is a warm start, Cactus).

ytangnoaa commented 3 months ago

Thank you for this information. Since changing workflow can solve this issue, this stop should not be caused by the AQM code. The new workflow seems improve the timing issue, too, which is great

BrianCurtis-NOAA commented 3 months ago

@JianpingHuang-NOAA Is this issue then not in the ufs-weather-model but in the ufs-srweather-app aqm_dev branch?

JianpingHuang-NOAA commented 3 months ago

@BrianCurtis-NOAA There are two issues that Youhua mentioned: One related to segmentation fault and the other concerning the EBI Euler convergence failure. We did not see the first issue when I tested the production/AQMv7 based AQMv8.0 package (not sure the reasons), but the second one remains...

drnimbusrain commented 3 months ago

@JianpingHuang-NOAA @ytangnoaa

The EBI solver issue can be caused by a many different issues with input variables to the solver. As Youhua mentioned, this EBI solver issue appears in both AQMv7 with CMAQv5.2 (cb6r3ae6) and our updated AQMv8 with CMAQv5.4 (cb6r5ae7). I fear that this has been an issue in AQM all along coupling with UFS met/land variables coupled from FV3GFSv16.

To identify issue, we would need to look closely to diagnose if any met/land input variables are erroneous. For example, I found worse EBI solver issues (not just a warning, but fully crash) when switching to GFSv17 physics, which stemmed from erroneously large LAI values coming from Noah-MP LSM for glacier points. I needed to put a temporary local fix into my testing version of AQM for this to make it get past this issue.

I wonder if there are strange LAI (or some other met/land input) variables that are causing this issue from the GFS16 physics to UFS-AQM. I also think the current resistances (like the stomatal/canopy resistances, "RCA") coming from GFSv16 into UFS-AQM are wrong as well (likely contributing to other performance issues, e.g., ozone overprediction), and this could be causing an exacerbated issue in AQM with updated CMAQv5.4. Maybe this makes the EBI solver issue worse in AQM with CMAQv5.4, but again, as Youhua pointed out its happening in both versions.

So, its possible the EBI solver issue (already present in AQMv7/CMAQv5.2) is made worse due to numerous air-sfc-x and chemistry changes in updated CMAQv5.4.

ytangnoaa commented 3 months ago

@JianpingHuang-NOAA FYI, Wei Li tested the Rosenbrock solver (ROS3) (https://ir.cwi.nl/pub/10743/10743D.pdf), which shows much more consistent timing without the converge issue, and can reduce the overall runtime. ROS3 is supposed to be more accurate, too. Wei Li will show how to activate ROS3 solver by changing the cmake file.

JianpingHuang-NOAA commented 3 months ago

@ytangnoaa @drnimbusrain Thanks for your suggestion and information. It takes about 3 hours for the AQM8.0 package with the latest dev branch of UFSWM and CMAQ5.4 to complete 72-hr simulations for the AQM_NA_9km domain when I use 72 nodes.

drnimbusrain commented 3 months ago

@ytangnoaa @drnimbusrain Thanks for your suggestion and information. It takes about 3 hours for the AQM8.0 package with the latest dev branch of UFSWM and CMAQ5.4 to complete 72-hr simulations for the AQM_NA_9km domain when I use 72 nodes.

@JianpingHuang-NOAA What about the 13-km domain with ROS3 solver?

JianpingHuang-NOAA commented 3 months ago

@drnimbusrain I am still using the EBI solver not Rosenbrock solver

drnimbusrain commented 3 months ago

@drnimbusrain I am still using the EBI solver not Rosenbrock solver

@JianpingHuang-NOAA We recommend switching to the ROS3 solver, as this seems more stable and faster from Wei's intial test. Still need to see results of predictions though.

JianpingHuang-NOAA commented 3 months ago

I thought that ROS3 solver is more accurate but slower. What is the speed of ROS3 solver as compared to EBI for the 13-km domain?

drnimbusrain commented 3 months ago

@JianpingHuang-NOAA Please see Youhua's response: " FYI, Wei Li tested the Rosenbrock solver (ROS3) (https://ir.cwi.nl/pub/10743/10743D.pdf), which shows much more consistent timing without the converge issue, and can reduce the overall runtime. ROS3 is supposed to be more accurate, too"

Wei will report his initial test here, and let you know how you can run longer tests to evaluate.

lwcugb commented 3 months ago

I thought that ROS3 solver is more accurate but slower. What is the speed of ROS3 solver as compared to EBI for the 13-km domain?

Hi Jianping,

It takes about 2mins to run one hour using ROS3 solver, while it seems much longer for EBI solver. I did a one-day test using both solvers with cold start on Hera. The output paths are:

/scratch1/NCEPDEV/stmp4/Wei.K.Li/expt_dirs/aqm_dev_cmaq54_ros3/ #ROS3 /scratch1/NCEPDEV/stmp4/Wei.K.Li/expt_dirs/aqm_dev_cmaq54/ # EBI

I only see very small differences in O3 and PM2.5 between these two versions, although a longer simulation is necessary to confirm this.

It is easy to switch to the ROS3 solver. Only two files need to be modified. See the file changes here: https://github.com/noaa-oar-arl/AQM/compare/feature/cmaq54...lwcugb:AQM:feature/camq54_ros3?expand=1. I created a branch in my GitHub for this update (https://github.com/lwcugb/AQM/tree/feature/camq54_ros3). You can check it out to run a longer simulation.

drnimbusrain commented 3 months ago

Great work Wei!! While I am still concerned with input met/land in UFS-AQM potentially causing these EBI solver issues, moving to the ROS3 solver seems like a good path forward!

JianpingHuang-NOAA commented 3 months ago

Thanks.

Jianping

On Fri, Aug 9, 2024 at 5:04 PM Patrick Campbell @.***> wrote:

Great work Wei!! While I am still concerned with input met/land in UFS-AQM potentially causing these EBI solver issues, moving to the ROS3 solver seems like a good path forward!

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2386#issuecomment-2278753896, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANA2PI4FBPMT56PH4I6QSILZQUVEHAVCNFSM6AAAAABL44GNYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYG42TGOBZGY . You are receiving this because you were mentioned.Message ID: @.***>