oceanmodeling / ondemand-storm-workflow

Other
2 stars 1 forks source link

Add support for SCHISM wave model #7

Closed SorooshMani-NOAA closed 1 year ago

SorooshMani-NOAA commented 1 year ago

Duplicating in correct repo:

The SCHISM model is supposed to work with both sflux and parametric wind models. It would be useful to have the ability to set it up from here while we're waiting for SCHISM NEMS coupling to be fully implemented.

and

We need the following minimum to get the model running with WWM:

  • hgrid_WWM.gr3: since there is no quads in ur mesh this is identical to hgrid.gr3
  • wwmbnd.gr3: strictly speaking we should use WW3 as i.c. and b.c. but I’m cheating here with no wave inputs at the boundary. It works fine as the boundary is far from the domain of interest
  • wwminput.nml is the WWM main input
  • param.nml updated

Since we need to move to fully coupled WW3 sometime soon, we'll add the minimum required to get this working.

SorooshMani-NOAA commented 1 year ago

@josephzhang8 is it OK to use SCHISM compiled with USE_WWM, for a case where WWM setup does not exist?

In my ensemble I run 1 spinup run without any WWM setup and then run all the ensemble members hotstarted from that single "spinup" run. All of these following members have WWM setup. I was wondering if I'd need separate compiled versions of SCHISM: one with USE_WWM equal to true, and one without it?

In my attempt, my spinup run failed with the following message:

Abort(67767564) on node 143 (rank 143 in comm 0): Fatal error in PMPI_Type_create_indexed_block: Invalid argument, error stack:
PMPI_Type_create_indexed_block(175): MPI_Type_create_indexed_block(count=17, blocklength=-2147483647, array_of_displacements=0x4741860, dtype=0x4c000829, newtype=0x664fc40) failed
PMPI_Type_create_indexed_block(138): Invalid value for blocklen, must be non-negative but is -2147483647

for each core. I thought it might have something to do with not having WWM parameters set in this spinup run.

I'd really appreciate any suggestions you might have for debugging this. Thanks!

SorooshMani-NOAA commented 1 year ago

This was discussed with Joseph in a meeting. The issue is in fact using the same binaries that are compiled with USE_WWM for non-WWM setup. Separate sets of binaries need to be used for cases where WWM is and isn't used.

SorooshMani-NOAA commented 1 year ago

~After fixing the issue above, running a single deterministic run shows that the runtime is significantly affected by the addition of WWM. Two similar runs for Florence 2018 was started, one (without WWM) was done in about 3 hours, the other one is still going on after about 8 hours!~

These runs where on docker on AWS EC2

Updated The run was stuck on ECS (docker compiled using GNU), the same setup resulted in error on PW which was compiled using Intel

SorooshMani-NOAA commented 1 year ago

~From an implementation point of view this ticket is done (still waiting for the run), but from a practicality point of view, is it worth it to run WWM?~

Updated The run was stuck, not actually taking computational time!

SorooshMani-NOAA commented 1 year ago

The latest is that with the update none-WWM simulations work fine, but the PW compiled version results in:

forrtl: severe (19): invalid reference to variable in NAMELIST input, unit 15, file /lustre/hurricanes/florence_2018_8dbc189c-1180-4fdd-96ee-3b3d5e990254/setup/schism.dir/.//param.nml, line 33, position 15
Image              PC                Routine            Line        Source             
pschism_WWM_PAHM_  0000000000AB92A8  for__io_return        Unknown  Unknown
pschism_WWM_PAHM_  0000000000AF5CB7  for_read_seq_nml      Unknown  Unknown
pschism_WWM_PAHM_  0000000000464F55  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000041466F  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  00000000004144A2  Unknown               Unknown  Unknown
libc-2.17.so       00002B323E648555  __libc_start_main     Unknown  Unknown
pschism_WWM_PAHM_  00000000004143A9  Unknown               Unknown  Unknown

and the ECS version is still running after 20 hours!

SorooshMani-NOAA commented 1 year ago

Actually looking at the results of the output of the ECS run, it hasn't started cycling yet. Something's wrong there as well

SorooshMani-NOAA commented 1 year ago

The issue seems to be related to nrampwafo in param.nml I don't find it being used in the source code (commented), maybe we should remove it from param.nml

SorooshMani-NOAA commented 1 year ago

After removing nrampwafo I get the following error:

   ncVarNam_WndY        = vwnd

   modelType            = *
---------- MODEL PARAMETERS ----------

  22: ABORT: THERE IS AND ERROR IN MSC2 OR MDC2 IN PARAM.IN
  17: ABORT: THERE IS AND ERROR IN MSC2 OR MDC2 IN PARAM.IN
  24: ABORT: THERE IS AND ERROR IN MSC2 OR MDC2 IN PARAM.IN
Abort(0) on node 22 (rank 22 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 22
Abort(0) on node 17 (rank 17 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 17
Abort(0) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 24
  21: ABORT: THERE IS AND ERROR IN MSC2 OR MDC2 IN PARAM.IN
Abort(0) on node 21 (rank 21 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 21
josephzhang8 commented 1 year ago

Make sure M[SD]C2 in param.nml match those in wwminput.nml (m[sd]c)

-Joseph

Y. Joseph Zhang Web: schism.wiki Office: 804 684 7466

From: Soroosh Mani @.> Sent: Friday, March 24, 2023 10:41 AM To: oceanmodeling/ondemand-storm-workflow @.> Cc: Y. Joseph Zhang @.>; Mention @.> Subject: Re: [oceanmodeling/ondemand-storm-workflow] Add support for SCHISM wave model (Issue #7)

[EXTERNAL to VIMS received message]

After removing nrampwafo I get the following error:

ncVarNam_WndY = vwnd

modelType = *

---------- MODEL PARAMETERS ----------

22: ABORT: THERE IS AND ERROR IN MSC2 OR MDC2 IN PARAM.IN

17: ABORT: THERE IS AND ERROR IN MSC2 OR MDC2 IN PARAM.IN

24: ABORT: THERE IS AND ERROR IN MSC2 OR MDC2 IN PARAM.IN

Abort(0) on node 22 (rank 22 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 22

Abort(0) on node 17 (rank 17 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 17

Abort(0) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 24

21: ABORT: THERE IS AND ERROR IN MSC2 OR MDC2 IN PARAM.IN

Abort(0) on node 21 (rank 21 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 21

- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Foceanmodeling%2Fondemand-storm-workflow%2Fissues%2F7%23issuecomment-1482915815&data=05%7C01%7Cyjzhang%40vims.edu%7C94a7209f5ae64462aa2b08db2c75dab5%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C638152656923950313%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ghf73hXLY%2FFAuOZtPbB%2B%2FqsFhUral7nKVnmmw1cyWsM%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFBKNZ6IYTP6PBFCEDQ3INLW5WXBRANCNFSM6AAAAAAV62CUXM&data=05%7C01%7Cyjzhang%40vims.edu%7C94a7209f5ae64462aa2b08db2c75dab5%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C638152656923950313%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Jlc00C0QiOkUt3wo9bWM4%2B%2FIcuyiw0XCfIaWdwHT2gs%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.**@.>>

SorooshMani-NOAA commented 1 year ago

Thanks @josephzhang8! Yes I realized I mixed up MSD and MCD when assigning them in the two nml files. Right now everything seems to be working fine. I'm waiting for the runs to go through and update/close this ticket

SorooshMani-NOAA commented 1 year ago

The workflow on PW fails with an error about not finding wwminput.nml, it needs to be further investigated. The setup is exactly the same as the one for ECS, maybe some files are not copied to PW correctly(?)

forrtl: severe (30): open failure, unit 50006, file /lustre/hurricanes/florence_2018_3ca099f8-e0e7-41bb-807d-ae51057a7281/setup/schism.dir/wwmcheck.nml
Image              PC                Routine            Line        Source
pschism_WWM_PAHM_  0000000000AB92A8  for__io_return        Unknown  Unknown
pschism_WWM_PAHM_  0000000000AD93E2  for_open              Unknown  Unknown
pschism_WWM_PAHM_  00000000005FD075  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  0000000000603BCA  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  00000000005B4357  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000049236B  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000041466F  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  00000000004144A2  Unknown               Unknown  Unknown
libc-2.17.so       00002AFBF6BFA555  __libc_start_main     Unknown  Unknown
pschism_WWM_PAHM_  00000000004143A9  Unknown               Unknown  Unknown
...
   5: ABORT: Missing grid file : hgrid_WWM.gr3
...
forrtl: severe (30): open failure, unit 50010, file /lustre/hurricanes/florence_2018_3ca099f8-e0e7-41bb-807d-ae51057a7281/setup/schism.dir/hgrid_WWM.gr3
Image              PC                Routine            Line        Source
pschism_WWM_PAHM_  0000000000AB92A8  for__io_return        Unknown  Unknown
pschism_WWM_PAHM_  0000000000AD93E2  for_open              Unknown  Unknown
pschism_WWM_PAHM_  000000000062C9DC  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000062F2E1  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  0000000000603BF2  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  00000000005B4357  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000049236B  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000041466F  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  00000000004144A2  Unknown               Unknown  Unknown
libc-2.17.so       00002AB362BB4555  __libc_start_main     Unknown  Unknown
pschism_WWM_PAHM_  00000000004143A9  Unknown               Unknown  Unknown
...
forrtl: severe (38): error during write, unit 11, file /lustre/hurricanes/florence_2018_3ca099f8-e0e7-41bb-807d-ae51057a7281/setup/schism.dir/.//outputs/fatal.error
Image              PC                Routine            Line        Source
pschism_WWM_PAHM_  0000000000AB92A8  for__io_return        Unknown  Unknown
pschism_WWM_PAHM_  0000000000B18D05  for_write_seq_fmt     Unknown  Unknown
pschism_WWM_PAHM_  00000000004502D6  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000078E712  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000062A4A3  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000062F2E1  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  0000000000603BF2  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  00000000005B4357  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000049236B  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  000000000041466F  Unknown               Unknown  Unknown
pschism_WWM_PAHM_  00000000004144A2  Unknown               Unknown  Unknown
libc-2.17.so       00002BA405238555  __libc_start_main     Unknown  Unknown
pschism_WWM_PAHM_  00000000004143A9  Unknown               Unknown  Unknown
   0: ABORT: Missing input file : wwminput.nml
Abort(0) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000003, 0) - process 0

The workflow on ECS times out (workflow task on ECS has a 4-hour limit) at

TIME STEP=         7659;  TIME=       1148850.000000

the same simulation without WWM takes about 1.5 hours to complete 13536 timesteps.

TIME STEP=        13536;  TIME=       2030400.000000

Run completed successfully at 20230323, 171556.921
SorooshMani-NOAA commented 1 year ago

... maybe some files are not copied to PW correctly(?)

Running schism manually works fine and starts the time-stepping. That means the issue must be how the sbatch process gets started from the workflow. The strange thing is that this process is the same for non-wwm process, and that works find (tested)

josephzhang8 commented 1 year ago

@Soroosh Mani - NOAA @.***>

I've tested the new master with model_type_pahm specified in param.nml

-Joseph

Y. Joseph Zhang Web: schism.wiki Office: 804 684 7466

From: Soroosh Mani @.> Sent: Friday, March 24, 2023 4:27 PM To: oceanmodeling/ondemand-storm-workflow @.> Cc: Y. Joseph Zhang @.>; Mention @.> Subject: Re: [oceanmodeling/ondemand-storm-workflow] Add support for SCHISM wave model (Issue #7)

[EXTERNAL to VIMS received message]

... maybe some files are not copied to PW correctly(?)

Running schism manually works fine and starts the time-stepping. That means the issue must be how the sbatch process gets started from the workflow. The strange thing is that this process is the same for non-wwm process, and that works find (tested)

- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Foceanmodeling%2Fondemand-storm-workflow%2Fissues%2F7%23issuecomment-1483369322&data=05%7C01%7Cyjzhang%40vims.edu%7C3eba2760eb714983a5c408db2ca60f2b%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C638152863976363803%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=0D2PCreCEtZ2Fxf5G7zrjOWIIY4YZWbVtYTSkwQTkUs%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFBKNZ3GQQ2Z6PDNEJBH6VLW5X7PTANCNFSM6AAAAAAV62CUXM&data=05%7C01%7Cyjzhang%40vims.edu%7C3eba2760eb714983a5c408db2ca60f2b%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C638152863976363803%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jdsL3I%2BMULb1DoeDDLBIate%2BxneO6yd3DSNHg0QwxAo%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.**@.>>

SorooshMani-NOAA commented 1 year ago

@josephzhang8 Thank you for the update.

SorooshMani-NOAA commented 1 year ago

Strangely another deterministic run in PW was successful (without any changes to scripts!). Timewise, the WWM takes about 3 hours compared to the same non-WWM setup that takes 40min on PW platform. The next step is to run an ensemble. Note the shorter times are because of past-forecast simulation (48hr to landfall) compared to the full best track which takes ~2.5 hours without WWM