schism-dev / schism

Semi-implicit Cross-scale Hydroscience Integrated System Model (SCHISM)
http://ccrm.vims.edu/schismweb/
Apache License 2.0
78 stars 84 forks source link

Question: Is AWS Pcluster supported by SCHISM? #88

Open zeekus opened 1 year ago

zeekus commented 1 year ago

Has anyone got schism working on aws pcluster 3.2 ?

I created a 8 node cluster on AWS with 256 cores and 512 GB of RAM. But, on the run I am getting segfaults.

SlurmQueues:

os: centos7 modules loaded: hdf5-1.12.2-gcc-4.8.5-omqotpp openmpi-4.1.4-gcc-4.8.5-23hmmfu netcdf-fortran-4.5.4-gcc-4.8.5-y6iccqw netcdf-c-4.8.1-gcc-4.8.5-2eml4r3 compiled: GNU Fortran (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) Binary: -> /modeling/pschism/icm_Balg/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL -version

schism v5.9.0mod git hash 2e289ae (20 commits since semantic tag, edits=True)

My mpirun call:
--> trying to tell mpi to run 32 processes on each node: Is this syntax correct ? /opt/parallelcluster/shared/spack/opt/spack/linux-centos7-haswell/gcc-4.8.5/openmpi-4.1.4-23hmmfud3rw4njh3m5ilmukatjrgn4i2/bin/mpirun --hostfile hostnames.txt -n 32 --map-by node /modeling/pschism/icm_Balg/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL

job out says: 7 total processes killed (some possibly by mpirun during cleanup) did we get an error 139

The error file says invalid memory references.

[centos@ip-10-137-0-172 tune44h]$ cat job.78.err Currently Loaded Modulefiles: 1) netcdf-c-4.8.1-gcc-4.8.5-2eml4r3 2) netcdf-fortran-4.5.4-gcc-4.8.5-y6iccqw 3) hdf5-1.12.2-gcc-4.8.5-omqotpp 4) openmpi-4.1.4-gcc-4.8.5-23hmmfu Warning: Permanently added 'compute-dy-slurmworkers-4,10.137.0.181' (ECDSA) to the list of known hosts. Warning: Permanently added 'compute-dy-slurmworkers-7,10.137.0.179' (ECDSA) to the list of known hosts. Warning: Permanently added 'compute-dy-slurmworkers-3,10.137.0.137' (ECDSA) to the list of known hosts. Warning: Permanently added 'compute-dy-slurmworkers-6,10.137.0.161' (ECDSA) to the list of known hosts. Warning: Permanently added 'compute-dy-slurmworkers-5,10.137.0.133' (ECDSA) to the list of known hosts. Warning: Permanently added 'compute-dy-slurmworkers-2,10.137.0.143' (ECDSA) to the list of known hosts. Warning: Permanently added 'compute-dy-slurmworkers-8,10.137.0.182' (ECDSA) to the list of known hosts.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

0 0x7F27901B96D7

1 0x7F27901B9D1E

2 0x7F278F4983FF

3 0x53C5D0 in calkwq_

4 0x554151 in ecosystem_

5 0x45F1EC in schismstep

6 0x404C4B in schismmain

zeekus commented 1 year ago

seem to get a similar error when I run this on our controller node without mpiexec.

[centos@ip-10-137-0-172 tune44h]$ /modeling/pschism/icm_Balg/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

0 0x2B815BBEC6D7

1 0x2B815BBECD1E

2 0x2B815C89B3FF

3 0x53C5D0 in calkwq_

4 0x554151 in ecosystem_

5 0x45F1EC in schismstep

6 0x404C4B in schismmain

7 0x404CB6 in MAIN__ at schism_driver.F90:?

Segmentation fault (core dumped)

zeekus commented 1 year ago

It appears pschism will work on AWS pcluster if Intel Ifortan and Intel MPI libraries are used. I am thinking that this code may not be compatible with Gfortran. Has anyone successfully run pschism on Gfortran ? To get things to compile on our end I had to modify two files. Maybe there are other Gfortran compatiblity issues I am missing.

Files and lines modified:

  1. File: ./icm_Balg/src/Hydro/schism_init.F90:5451.114
  2. File: ./icm_Balg/src/ICM/icm_sed_flux.F90

Ref: ./icm_Balg/src/Hydro/schism_init.F90:5451.114 Summary: array format issue. All strings need to be the same size for gfortran to compile.

','CPOC ','tlfveg ','tstveg ','trtveg ','hcanveg','lfsav ','stsav ','rts

Error: Different CHARACTER lengths (7/6) in array constructor at (1) make[3]: [Hydro/CMakeFiles/hydro.dir/schism_init.F90.o] Error 1 make[2]: [Hydro/CMakeFiles/hydro.dir/all] Error 2 make[1]: *** [Driver/CMakeFiles/pschism.dir/rule] Error 2

Ref: ./icm_Balg/src/ICM/read_icm_input.F90:326: Summary: Gfortan seems to process '/*' as a closing comment in line 326.

Output: /modeling/pschism/icm_Balg/src/ICM/read_icm_input.F90:326:0: warning: extra tokens at end of #endif directive [enabled by default]

endif ICM_PH

^ /modeling/pschism/icm_Balg/src/ICM/icm_sed_flux.F90:1385:0: error: unterminated comment !with all state variables in unit of g/*, no need to transfer ^ Error: Unexpected end of file in '/modeling/pschism/icm_Balg/src/ICM/icm_sed_flux.F90' make[3]: [ICM/CMakeFiles/icm.dir/icm_sed_flux.F90.o] Error 1 make[3]: Waiting for unfinished jobs....

josephzhang8 commented 1 year ago

Hi Teddy:

Many of us have used gcc. The error you mentioned below seems to be from an older tag (5.7.0?) and has been fixed in newer tag v5.10.0.

-Joseph

Y. Joseph Zhang Web: schism.wiki Office: 804 684 7466

From: Teddy Knab @.> Sent: Thursday, September 29, 2022 8:10 AM To: schism-dev/schism @.> Cc: Subscribed @.***> Subject: Re: [schism-dev/schism] Question: Is AWS Pcluster supported by SCHISM? (Issue #88)

[EXTERNAL to VIMS received message]

Per an email from NOAA who runs this on AWS it appears pschism will work on AWS pcluster if Intel Ifortan and Intel MPI libraries are used. I am thinking that this code may not be compatible with Gfortran. Has anyone successfully run pschism on Gfortran ? To get things to compile on our end I had to modify two files. Maybe there are other compatiblity issues I am missing.

Files and lines modified:

  1. File: ./icm_Balg/src/Hydro/schism_init.F90:5451.114
  2. File: ./icm_Balg/src/ICM/icm_sed_flux.F90

Ref: ./icm_Balg/src/Hydro/schism_init.F90:5451.114 Summary: array format issue. All strings need to be the same size for gfortran to compile.

','CPOC ','tlfveg ','tstveg ','trtveg ','hcanveg','lfsav ','stsav ','rts

Error: Different CHARACTER lengths (7/6) in array constructor at (1) make[3]: [Hydro/CMakeFiles/hydro.dir/schism_init.F90.o] Error 1 make[2]: [Hydro/CMakeFiles/hydro.dir/all] Error 2 make[1]: *** [Driver/CMakeFiles/pschism.dir/rule] Error 2

Ref: ./icm_Balg/src/ICM/read_icm_input.F90:326: Summary: Gfortan seems to process '/*' as a closing comment in line 326.

Output: /modeling/pschism/icm_Balg/src/ICM/read_icm_input.F90:326:0: warning: extra tokens at end of #endif directive [enabled by default]

endif ICM_PH

^ /modeling/pschism/icm_Balg/src/ICM/icm_sed_flux.F90:1385:0: error: unterminated comment !with all state variables in unit of g/*, no need to transfer ^ Error: Unexpected end of file in '/modeling/pschism/icm_Balg/src/ICM/icm_sed_flux.F90' make[3]: [ICM/CMakeFiles/icm.dir/icm_sed_flux.F90.o] Error 1 make[3]: Waiting for unfinished jobs....

- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fschism-dev%2Fschism%2Fissues%2F88%23issuecomment-1262186601&data=05%7C01%7Cyjzhang%40vims.edu%7Cc67dcdb6b0ac492491d108daa2137a64%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C638000501781978945%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HvLBB594lGh60Pr7w7fQ7eh%2FNx2g8B9Ika0YYTd8vYk%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFBKNZ6DKG6ANEFK5EOZ3MLWAWBIBANCNFSM6AAAAAAQWYNOLM&data=05%7C01%7Cyjzhang%40vims.edu%7Cc67dcdb6b0ac492491d108daa2137a64%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C638000501781978945%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=s%2BJyXZSTvbQTxxB5wdHL%2F3usuBywpxQXDOdpMKMkFXo%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

josephzhang8 commented 1 year ago

Since ICM is very important for EPA/CBPO, v5.10.0 is the one to use. Zhengui has spent a lot of time cleaning up ICM for the past year.

-Joseph

Y. Joseph Zhang Web: schism.wiki Office: 804 684 7466

From: Y. Joseph Zhang Sent: Thursday, September 29, 2022 9:08 AM To: schism-dev/schism @.>; schism-dev/schism @.> Cc: Subscribed @.***> Subject: RE: [schism-dev/schism] Question: Is AWS Pcluster supported by SCHISM? (Issue #88)

Hi Teddy:

Many of us have used gcc. The error you mentioned below seems to be from an older tag (5.7.0?) and has been fixed in newer tag v5.10.0.

-Joseph

Y. Joseph Zhang Web: schism.wiki Office: 804 684 7466

From: Teddy Knab @.**@.>> Sent: Thursday, September 29, 2022 8:10 AM To: schism-dev/schism @.**@.>> Cc: Subscribed @.**@.>> Subject: Re: [schism-dev/schism] Question: Is AWS Pcluster supported by SCHISM? (Issue #88)

[EXTERNAL to VIMS received message]

Per an email from NOAA who runs this on AWS it appears pschism will work on AWS pcluster if Intel Ifortan and Intel MPI libraries are used. I am thinking that this code may not be compatible with Gfortran. Has anyone successfully run pschism on Gfortran ? To get things to compile on our end I had to modify two files. Maybe there are other compatiblity issues I am missing.

Files and lines modified:

  1. File: ./icm_Balg/src/Hydro/schism_init.F90:5451.114
  2. File: ./icm_Balg/src/ICM/icm_sed_flux.F90

Ref: ./icm_Balg/src/Hydro/schism_init.F90:5451.114 Summary: array format issue. All strings need to be the same size for gfortran to compile.

','CPOC ','tlfveg ','tstveg ','trtveg ','hcanveg','lfsav ','stsav ','rts

Error: Different CHARACTER lengths (7/6) in array constructor at (1) make[3]: [Hydro/CMakeFiles/hydro.dir/schism_init.F90.o] Error 1 make[2]: [Hydro/CMakeFiles/hydro.dir/all] Error 2 make[1]: *** [Driver/CMakeFiles/pschism.dir/rule] Error 2

Ref: ./icm_Balg/src/ICM/read_icm_input.F90:326: Summary: Gfortan seems to process '/*' as a closing comment in line 326.

Output: /modeling/pschism/icm_Balg/src/ICM/read_icm_input.F90:326:0: warning: extra tokens at end of #endif directive [enabled by default]

endif ICM_PH

^ /modeling/pschism/icm_Balg/src/ICM/icm_sed_flux.F90:1385:0: error: unterminated comment !with all state variables in unit of g/*, no need to transfer ^ Error: Unexpected end of file in '/modeling/pschism/icm_Balg/src/ICM/icm_sed_flux.F90' make[3]: [ICM/CMakeFiles/icm.dir/icm_sed_flux.F90.o] Error 1 make[3]: Waiting for unfinished jobs....

- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fschism-dev%2Fschism%2Fissues%2F88%23issuecomment-1262186601&data=05%7C01%7Cyjzhang%40vims.edu%7Cc67dcdb6b0ac492491d108daa2137a64%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C638000501781978945%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HvLBB594lGh60Pr7w7fQ7eh%2FNx2g8B9Ika0YYTd8vYk%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFBKNZ6DKG6ANEFK5EOZ3MLWAWBIBANCNFSM6AAAAAAQWYNOLM&data=05%7C01%7Cyjzhang%40vims.edu%7Cc67dcdb6b0ac492491d108daa2137a64%7C8cbcddd9588d4e3b9c1e2367dbdf1740%7C0%7C0%7C638000501781978945%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=s%2BJyXZSTvbQTxxB5wdHL%2F3usuBywpxQXDOdpMKMkFXo%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

zeekus commented 1 year ago

Thanks. It seems I was using the wrong version.

version pulled

[centos@ip-10-137-0-172 tune44h]$ /modeling/pschism/icm_Balg/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL -version

schism v5.9.0mod git hash 2e289ae (20 commits since semantic tag, edits=False)

version 5.10

[centos@ip-10-137-0-172 tune44h]$ /modeling/pschism/icm_Balg_v5.10/build/bin/pschism_ICM_ANALYSIS_PREC_EVAP_TVD-VL -version

schism develop git hash aaa98b3

The proper version seems to be running.

[centos@ip-10-137-0-172 tune44h]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 248 compute pschism_ centos R 19:13 8 compute-dy-slurmworkers-[1-8]

ifort (IFORT) 2021.6.0 20220226 Copyright (C) 1985-2022 Intel Corporation. All rights reserved.