Closed uturuncoglu closed 4 years ago
@GeorgeGayno-NOAA Have there been any changes in chgres_cube from v1.0 release to v1.1 release that would lead it t require more memory when processing GRIB2 data? And do you know if/how the memory requirements to process GRIB2, NEMSIO, and netCDF format differ?
@GeorgeGayno-NOAA Have there been any changes in chgres_cube from v1.0 release to v1.1 release that would lead it t require more memory when processing GRIB2 data? And do you know if/how the memory requirements to process GRIB2, NEMSIO, and netCDF format differ?
The GRIB2 option should not require more memory than before. I can run a GRIB2 case on only one of our WCOSS nodes. The netCDF option will require the most memory of the three. You may be running with more MPI tasks (144 and 216) than you need. That can sometimes lead to an error. Try running with 24 MPI tasks.
@GeorgeGayno-NOAA Let me check by reducing number of processor for chgres. I'll update you soon.
@GeorgeGayno-NOAA: @uturuncoglu has let us know that he is having trouble finding the right number of processor to run chgres with. Do you have a recommendation for the number of processors that should be used for the various App supported resolutions C96, C192, C384, and C768? And should the number of processors vary depending on the format of the data that chgres is reading in?
@GeorgeGayno-NOAA: @uturuncoglu has let us know that he is having trouble finding the right number of processor to run chgres with. Do you have a recommendation for the number of processors that should be used for the various App supported resolutions C96, C192, C384, and C768? And should the number of processors vary depending on the format of the data that chgres is reading in?
I don't have access to Cheyenne. But I run some tests on Hera, Jet or Orion. Is Cheyenne similar to those machines?
@GeorgeGayno-NOAA: @uturuncoglu has let us know that he is having trouble finding the right number of processor to run chgres with. Do you have a recommendation for the number of processors that should be used for the various App supported resolutions C96, C192, C384, and C768? And should the number of processors vary depending on the format of the data that chgres is reading in?
I don't have access to Cheyenne. But I run some tests on Hera, Jet or Orion. Is Cheyenne similar to those machines?
Cheyenne as 2x18 cores 2.3-GHz Intel Xeon E5-2697V4 (Broadwell) processors, and 64GB of DDR4-2400 memory. I believe hera has more, but NOAA RDHPC docs isnt' working for me at the moment. Here is the Cheyenne documentation: https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne
I ran a test on xJet, which has 64 GB memory per node. Input was 0.5-degree grib2 data. It was mapped to a C768 global uniform grid with 64 atmos. levels. It ran on two nodes with six tasks per node. I don't know how close I was to the memory limit. To be safe (and to reduce wall clock time a bit), you can try three nodes, six tasks per node. The total number of tasks must be a multiple of six per ESMF requirements. You should not have to run chgres with hundreds of tasks.
@GeorgeGayno-NOAA Thanks. That is really helpful. BTW, how many core exist in the each node?
@GeorgeGayno-NOAA Thanks. That is really helpful. BTW, how many core exist in the each node?
24 cores per node.
@GeorgeGayno-NOAA I could process grib2 (0.5 deg.) C768 using 4 nodes with 6 tasks per node. I'll try same configuration with nemsio and netcdf input.
@GeorgeGayno-NOAA It seems that same configuration fails with nemsio. I'll try to increase the node.
@GeorgeGayno-NOAA I tried to use 6 nodes and 8 nodes with 6 tasks per node but it is still failing. If you don't mind could you test in your side?
I used 6 nodes, 24 process each node for C768 on Hera. It worked. @uturuncoglu
@panll Thanks. I am not sure why it is not working in my case. I have 64 GB memory in each node, which is same with Hera. I'll try to increase the number of cores per node. BTW, I could process NETCDF with 8 nodes with 6 tasks per node combination.
@panll same configuration (6 nodes, 24 process each node for C768) fails on Cheyenne.
@panll @GeorgeGayno-NOAA 8 nodes, 24 process each node for C768 also fails. Any suggestion?
@panll Thanks. I am not sure why it is not working in my case. I have 64 GB memory in each node, which is same with Hera. I'll try to increase the number of cores per node. BTW, I could process NETCDF with 8 nodes with 6 tasks per node combination.
I believe Hera has 96 Gb.
I will try a nemsio test today.
I tried a C768 uniform grid, 64 atmos. levels using nemsio data as input. On Jet (with 64 Gb per node), I was able to get it to run using 6 nodes, 6 tasks per node.
@uturuncoglu Do you think it would be good to try 6 nodes, 6 tasks per node for C768 on Cheyenne?
@ligiabernardet i was off in the morning, I am checking now.
@ligiabernardet @GeorgeGayno-NOAA I am still getting error with those configuration. The CHGRES fails with rc=<error reading variable: Cannot access memory at address 0x0>,
error.
i also increased the number of nodes from 6 to 8 but still failing. Any suggestion?
@uturuncoglu - @arunchawla-NOAA was asking if we ran chgres outside of CIME on Cheyenne. Is this something you have tried?
@rsdunlapiv no I did not try it but I could check it. BTW, I could process GRIB without any problem with 6x6 combination under CIME.
@rsdunlapiv I run chgres outside of the CIME with 6x6 combination that works on Jet but still fails on Cheyenne.
@ligiabernardet @GeorgeGayno-NOAA I am still getting error with those configuration. The CHGRES fails with
rc=<error reading variable: Cannot access memory at address 0x0>,
error.
Do you know the exact line where it is failing?
Here is the full trace that I have
MPT: 0x00002afaf83066da in waitpid () from /glade/u/apps/ch/os/lib64/libpthread.so.0
MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-49.16.x86_64
MPT: (gdb) #0 0x00002afaf83066da in waitpid ()
MPT: from /glade/u/apps/ch/os/lib64/libpthread.so.0
MPT: #1 0x00002afaf8a45db6 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7ffe589d9180 "MPT ERROR: Rank 3(g:3) received signal SIGSEGV(11).\n\tProcess ID: 6043, Host: r9i2n30, Program: /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/bin/chgres_cube.exe\n\tMPT Version: "...) at sig.c:340
MPT: #3 0x00002afaf8a45fb2 in first_arriver_handler (signo=signo@entry=11,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2afafe620080) at sig.c:489
MPT: #4 0x00002afaf8a4634b in slave_sig_handler (signo=11,
MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:564
MPT: #5 <signal handler called>
MPT: #6 0x00002afaf898e9bb in pmpi_abort__ ()
MPT: from /glade/u/apps/ch/opt/mpt/2.19/lib/libmpi.so
MPT: #7 0x00000000005337cd in error_handler (string=...,
MPT: rc=<error reading variable: Cannot access memory at address 0x0>,
MPT: .tmp.STRING.len_V$7=55327936)
MPT: at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/utils.f90:11
MPT: #8 0x00000000004fa084 in program_setup::read_setup_namelist ()
MPT: at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/program_setup.f90:287
MPT: #9 0x0000000000469b49 in chgres ()
MPT: at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/chgres.F90:63
MPT: #10 0x0000000000456aa2 in main ()
MPT: #11 0x00002afaf933c6e5 in __libc_start_main ()
MPT: from /glade/u/apps/ch/os/lib64/libc.so.6
MPT: #12 0x00000000004569a9 in _start () at ../sysdeps/x86_64/start.S:118
MPT: (gdb) A debugging session is active.
MPT:
MPT: Inferior 1 [process 6043] will be detached.
MPT:
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/6043/exe, process 6043
@GeorgeGayno-NOAA i also submit job to big memory nodes which has 109 GB memory on each node. I'll run with 6x6 combination and let you know.
@GeorgeGayno-NOAA it is still failing with same way. it seems that it is not due to memory limitation.
@GeorgeGayno-NOAA @ligiabernardet is there any change the input_data type in chgres side for nemsio?
Here is the full trace that I have
MPT: 0x00002afaf83066da in waitpid () from /glade/u/apps/ch/os/lib64/libpthread.so.0 MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-49.16.x86_64 MPT: (gdb) #0 0x00002afaf83066da in waitpid () MPT: from /glade/u/apps/ch/os/lib64/libpthread.so.0 MPT: #1 0x00002afaf8a45db6 in mpi_sgi_system ( MPT: #2 MPI_SGI_stacktraceback ( MPT: header=header@entry=0x7ffe589d9180 "MPT ERROR: Rank 3(g:3) received signal SIGSEGV(11).\n\tProcess ID: 6043, Host: r9i2n30, Program: /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/bin/chgres_cube.exe\n\tMPT Version: "...) at sig.c:340 MPT: #3 0x00002afaf8a45fb2 in first_arriver_handler (signo=signo@entry=11, MPT: stack_trace_sem=stack_trace_sem@entry=0x2afafe620080) at sig.c:489 MPT: #4 0x00002afaf8a4634b in slave_sig_handler (signo=11, MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:564 MPT: #5 <signal handler called> MPT: #6 0x00002afaf898e9bb in pmpi_abort__ () MPT: from /glade/u/apps/ch/opt/mpt/2.19/lib/libmpi.so MPT: #7 0x00000000005337cd in error_handler (string=..., MPT: rc=<error reading variable: Cannot access memory at address 0x0>, MPT: .tmp.STRING.len_V$7=55327936) MPT: at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/utils.f90:11 MPT: #8 0x00000000004fa084 in program_setup::read_setup_namelist () MPT: at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/program_setup.f90:287 MPT: #9 0x0000000000469b49 in chgres () MPT: at /glade/p/ral/jntp/GMTB/tools/NCEPLIBS-ufs-v1.1.0/intel-19.0.5/mpt-2.19/src/NCEPLIBS/UFS_UTILS/sorc/chgres_cube.fd/chgres.F90:63 MPT: #10 0x0000000000456aa2 in main () MPT: #11 0x00002afaf933c6e5 in __libc_start_main () MPT: from /glade/u/apps/ch/os/lib64/libc.so.6 MPT: #12 0x00000000004569a9 in _start () at ../sysdeps/x86_64/start.S:118 MPT: (gdb) A debugging session is active. MPT: MPT: Inferior 1 [process 6043] will be detached. MPT: MPT: Quit anyway? (y or n) [answered Y; input not from terminal] MPT: Detaching from program: /proc/6043/exe, process 6043
ESMF outputs "PET" files. What do they say? Your trace says the failure is in program_setup. That is very early in the processing.
As I know we where using gaussian before for nemsio but if I look at the chgres there are two options, gaussian_nemsio and gfs_gaussian_nemsio
It says - FATAL ERROR: UNRECOGNIZED INPUT DATA TYPE.
so I am setting input data type as gaussian. I think it is changed in the CHGRES when netcdf is introduced. Am I right?
NEMSIO -> gaussian netCDF -> gaussian_netcdf
On Thu, Sep 3, 2020 at 3:21 PM Ufuk Turunçoğlu notifications@github.com wrote:
It says - FATAL ERROR: UNRECOGNIZED INPUT DATA TYPE. so I am setting input data type as gaussian. I think it is changed in the CHGRES when netcdf is introduced. Am I right?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/179#issuecomment-686770844, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7WQAR3KXUENSG3PB4MADLSEACD7ANCNFSM4QQ7HXOQ .
@ligiabernardet I think that it is changed and now it is gaussian_nemsio. This is very critical information that I don't know. I have just submit the job by setting it to gaussian_nemsio and it is working.
OK, I did not realize that either.
On Thu, Sep 3, 2020 at 3:26 PM Ufuk Turunçoğlu notifications@github.com wrote:
@ligiabernardet https://github.com/ligiabernardet I think that it is changed and now it is gaussian_nemsio. This is very critical information that I don't know. I have just submit the job by setting it to gaussian_nemsio and it is working.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/179#issuecomment-686772797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7WQAUOFWUOSRLGVIIXSKTSEACXFANCNFSM4QQ7HXOQ .
@ligiabernardet @arunchawla-NOAA @GeorgeGayno-NOAA @panll I think that if we implement those kind of changes that could have critical importance for the CIME interface, we need to share that information or at least discuss about it. It is just a lack of information exchange that causes lots of extra effort for everyone.
@ligiabernardet we also need to update the documentation about it. I'll make necessary changes in the CIME side and try to run again.
I have a successful run for C768 on Cheyenne with 12 node and 3 cpus each. Here is the directory: /glade/scratch/lpan/09012020/ufs-mrweather-app-workflow.c768/run @uturuncoglu @ligiabernardet
Hallelujah! OK, we will update the documentation wrt input_type gaussian_nemsio.
On Thu, Sep 3, 2020 at 3:45 PM panll notifications@github.com wrote:
I have a successful run for C768 on Cheyenne with 12 node and 3 cpus each. Here is the directory: /glade/scratch/lpan/09012020/ufs-mrweather-app-workflow.c768/run @uturuncoglu https://github.com/uturuncoglu @ligiabernardet https://github.com/ligiabernardet
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-mrweather-app/issues/179#issuecomment-686780309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7WQAXUR4DLXKR5V22ITJTSEAFANANCNFSM4QQ7HXOQ .
@panll Thanks. That is great. I could also run with 6 node and 6 cpus each.
Great! @uturuncoglu
@uturuncoglu What are we still missing? Do you have a processor/node number that works for C384 on Cheyenne?
@ligiabernardet I run full test suite and one of the highest resolution case with debug failed after producing output of couple of hours. It seems it is not related with the CHGRES and I am looking into logs. C384 is working without any problem at least for grib2. I'll make more test with other data types for both C384 and C768 and then update the app for you to test. In the current configuration CHGRES is configured as,
C96 - 2 nodes with 6 core per node C192 - 2 nodes with 6 core per node C384 - 4 nodes with 6 core per node C768 - 6 nodes with 6 core per node
but to be in the safe side, we could increase the number of nodes for C384 and C768. I am also plaining to test it on Stampede but NCEP libs need to be installed there first. @climbfuji did you install NCEPLIBS there? do you have access to Stampede?
It says
- FATAL ERROR: UNRECOGNIZED INPUT DATA TYPE.
so I am setting input data type as gaussian. I think it is changed in the CHGRES when netcdf is introduced. Am I right?
Yes, when chgres was updated for GFS v16 netCDF files, the input_type names were changed. You should use "gaussian_nemsio" for GFS v15 nemsio files and "gaussian_netcdf" for GFS v16 netcdf files.
@GeorgeGayno-NOAA i am testing GNU on Cheyenne and for C768 the CHGRES is failing with following error
- CALL FieldCreate FOR INPUT GRID LONGITUDE.
- CALL FieldScatter FOR INPUT GRID LONGITUDE.
- CALL FieldScatter FOR INPUT GRID LONGITUDE.
- CALL FieldScatter FOR INPUT GRID LATITUDE.
- CALL FieldScatter FOR INPUT GRID LATITUDE.
- CALL FieldScatter FOR INPUT GRID LATITUDE.
- CALL FieldScatter FOR INPUT GRID LATITUDE.
#0 0x2adebec4faff in ???
#1 0x2adebf2d79bb in ???
#0 0x2adebec4faff in ???
#1 0x2adebf2d79bb in ???
#0 0x2b2c37867aff in ???
#1 0x2b2c37eef9bb in ???
#0 0x2b2c37867aff in ???
#1 0x2b2c37eef9bb in ???
#0 0x2b2c37867aff in ???
#1 0x2b2c37eef9bb in ???
#0 0x2b2c37867aff in ???
#1 0x2b2c37eef9bb in ???
#0 0x2b2c37867aff in ???
#1 0x2b2c37eef9bb in ???
MPT ERROR: MPI_COMM_WORLD rank 19 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11
I have no trace at this point because I am using @climbfuji installation. This resolution is running with Intel compiler. If you test it with GNU compiler could you let me know the correct combination.
@GeorgeGayno-NOAA additional log from C384. BTW, not all C384 test are failed.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
solved
@ligiabernardet @climbfuji
I am getting error on Cheyenne when i try to process the GRIB2 (0.5 deg.) data,
my CHGRES namelist file
It seems it is memory issue and we have failure for all C384 (default 144 core for CHGRES) and C768 (default 216 core for CHGRES). Do you have any idea why? It seems that CHGRES memory requirement is higher than the previous release but I am not sure. I could simply update the interface and increase the used resource for those resolution but I am not sure it is good idea or not and if you remember we had similar issue with NetCDF input. Let me know what do you think?