oceanmodeling / WW3

WAVEWATCH III
Other
0 stars 0 forks source link

WW3 MPI issue on Hercules #8

Closed yunfangsun closed 1 month ago

yunfangsun commented 5 months ago

Hi @aliabdolali @Sbanihash,

I met a MPI error when I run the WW3 on Hercules:

Previously, I used WW3 with version 520900e, the model could run smoothly both on Hera and Hercules.

And now I am using exactly the same settings, and updated to WW3 8b5e91f, there is an MPI error occurred on Hercules:

The ww3_shel.out shows:

                     *** WAVEWATCH III Program shell ***
               ===============================================

  Comment character is '$'

  Input fields :
 --------------------------------------------------
       water levels   ---/NO
       currents       ---/NO
       winds          YES/--
       ice fields     ---/NO
       momentum       ---/NO
       air density    ---/NO
       mean param.    ---/NO
       1D spectra     ---/NO
       2D spectra     ---/NO

EXTCDE MPI_ABORT, IEXIT=    52

EXTCDE MPI_ABORT, IEXIT=    52

EXTCDE MPI_ABORT, IEXIT=    52

EXTCDE MPI_ABORT, IEXIT=    52

EXTCDE MPI_ABORT, IEXIT=    52

EXTCDE MPI_ABORT, IEXIT=    52

EXTCDE MPI_ABORT, IEXIT=    52

EXTCDE MPI_ABORT, IEXIT=    52

EXTCDE MPI_ABORT, IEXIT=    52

the error log shows:

Currently Loaded Modules:
  1) intel-oneapi-compilers/2023.1.0  10) c-blosc/1.21.4
  2) stack-intel/2021.9.0             11) hdf5/1.14.0
  3) intel-oneapi-mpi/2021.9.0        12) netcdf-c/4.9.2
  4) stack-intel-oneapi-mpi/2021.9.0  13) netcdf-fortran/4.6.0
  5) zlib/1.2.13                      14) parallel-netcdf/1.12.2
  6) curl/8.1.2                       15) parallelio/2.5.10
  7) cmake/3.23.1                     16) esmf/8.4.2
  8) snappy/1.1.10                    17) scotch/7.0.4
  9) zstd/1.5.2

Abort(52) on node 2312 (rank 2312 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2312
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Abort(52) on node 2316 (rank 2316 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2316
Abort(52) on node 2308 (rank 2308 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2308
Abort(52) on node 2310 (rank 2310 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2310
Abort(52) on node 2314 (rank 2314 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2314
Abort(52) on node 2318 (rank 2318 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2318
Abort(52) on node 2320 (rank 2320 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2320
Abort(52) on node 2322 (rank 2322 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2322
Abort(52) on node 2324 (rank 2324 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2324
Abort(52) on node 2326 (rank 2326 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2326
Abort(52) on node 2328 (rank 2328 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2328
Abort(52) on node 2330 (rank 2330 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2330
Abort(52) on node 2332 (rank 2332 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2332
Abort(52) on node 2334 (rank 2334 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2334
Abort(52) on node 2336 (rank 2336 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2336
Abort(52) on node 2338 (rank 2338 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2338
Abort(52) on node 2309 (rank 2309 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2309
Abort(52) on node 2311 (rank 2311 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2311
Abort(52) on node 2313 (rank 2313 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2313
Abort(52) on node 2315 (rank 2315 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2315
Abort(52) on node 2317 (rank 2317 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2317
Abort(52) on node 2319 (rank 2319 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2319
Abort(52) on node 2321 (rank 2321 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2321
Abort(52) on node 2323 (rank 2323 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2323
Abort(52) on node 2325 (rank 2325 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2325
Abort(52) on node 2327 (rank 2327 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2327
Abort(52) on node 2329 (rank 2329 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2329
Abort(52) on node 2331 (rank 2331 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2331
Abort(52) on node 2333 (rank 2333 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2333
Abort(52) on node 2335 (rank 2335 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2335
Abort(52) on node 2339 (rank 2339 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2339
Abort(52) on node 2337 (rank 2337 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 52) - process 2337
slurmstepd: error: *** STEP 990701.0 ON hercules-02-05 CANCELLED AT 2024-04-11T08:32:10 ***
srun: error: hercules-06-42: tasks 2308-2339: Killed
srun: Terminating StepId=990701.0
srun: error: hercules-03-36: tasks 1120-1187: Killed
srun: error: hercules-02-06: tasks 80-159: Killed
srun: error: hercules-08-46: tasks 4180-4259: Killed
srun: error: hercules-03-29: tasks 560-639: Killed
srun: error: hercules-02-07: tasks 160-239: Killed
srun: error: hercules-03-26: tasks 320-399: Killed
srun: error: hercules-08-47: tasks 4260-4339: Killed
srun: error: hercules-02-57: tasks 240-319: Killed
srun: error: hercules-03-27: tasks 400-479: Killed
srun: error: hercules-03-34: tasks 960-1039: Killed
srun: error: hercules-03-32: tasks 800-879: Killed
srun: error: hercules-03-30: tasks 640-719: Killed
srun: error: hercules-03-33: tasks 880-959: Killed
srun: error: hercules-03-35: tasks 1040-1119: Killed
srun: error: hercules-03-31: tasks 720-799: Killed
srun: error: hercules-08-49: tasks 4420-4499: Killed
srun: error: hercules-06-27: tasks 1828-1907: Killed
srun: error: hercules-07-23: tasks 3140-3219: Killed
srun: error: hercules-06-32: tasks 2228-2307: Killed
srun: error: hercules-06-05: tasks 1268-1347: Killed
srun: error: hercules-06-04: tasks 1188-1267: Killed
srun: error: hercules-06-20: tasks 1748-1827: Killed
srun: error: hercules-07-25: tasks 3300-3379: Killed
srun: error: hercules-06-09: tasks 1588-1667: Killed
srun: error: hercules-08-48: tasks 4340-4419: Killed
srun: error: hercules-06-06: tasks 1348-1427: Killed
srun: error: hercules-06-30: tasks 2068-2147: Killed
srun: error: hercules-06-31: tasks 2148-2227: Killed
srun: error: hercules-06-28: tasks 1908-1987: Killed
srun: error: hercules-06-07: tasks 1428-1507: Killed

And I repeat the same update of WW3 on Hera, there is no such error message, and the model could run. And this error only happens on Hercules.

Could you please let me know if it was caused the configuration or library setting on Hercules?

Thank you very much!

Yunfang

aliabdolali commented 5 months ago

It might be a memory issue, can you try to use half of cores/node and redistributes the node memory on less cores? Try a smaller set up and see if it persists.

yunfangsun commented 5 months ago

Hi @aliabdolali ,

My original configuration is using 4500 processors, and I have reduced it to 2500, 1500, and 500 processors. they didn't work out. And I also increased #SBATCH --mem 9000, but it didn't help either.

And now, I will try WW3 on Hercules for a small domain case, to see if it could work.

sbanihash commented 5 months ago

@yunfangsun I was able to run ww3_ufs1.1 for version 8b5e91 without any issue. Please check my set up on Hercules here and let me know if you need help with anything. /work/noaa/marine/sbani/UFS_COASTAL/test_04302024/WW3/regtests/ww3_ufs1.1

yunfangsun commented 5 months ago

@sbanihash Thank you very much for the testing, could you please change permission of the folder of /work/noaa/marine/sbani/UFS_COASTAL/test_04302024/WW3/regtests/ww3_ufs1.1 /work/noaa/marine/sbani/UFS_COASTAL/test_04302024/WW3/regtests/matrix13

I can't get access to it, due to Permission denied

Thank you again!

janahaddad commented 1 month ago

@yunfangsun can you remind me where we left this issue?

yunfangsun commented 1 month ago

With the updated version of WW3, I can run WW3 without error messages on Hercules.

janahaddad commented 1 month ago

thx @yunfangsun !