Closed jiandewang closed 2 years ago
@jiandewang in order to investigate this, we (@DeniseWorthen and @climbfuji) need a fully self-contained run directory that we can work with. That means an experiment directory with all input files, configuration files, and the job submission script. Can you provide this on hera, please? Thanks.
run dir which contains all input and configuration files: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/DATAROOT/R_20120101/2012010100/gfs/fcst.125814
run log: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_20120101/logs/2012010100/gfs.forecast.highres.log.0
this is through workflow thus there is no job_card (as in rt.sh) in run dir
run dir which contains all input and configuration files: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/DATAROOT/R_20120101/2012010100/gfs/fcst.125814
run log: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_20120101/logs/2012010100/gfs.forecast.highres.log.0
this is through workflow thus there is no job_card (as in rt.sh) in run dir
I will not be able to work on this unless I get a job submission script. I believe rocoto can dump it out using some verbose flag. @JessicaMeixner-NOAA knows.
So I printed out the profile memory from the p7b runs and the memory usage is less in the runs from workflow, so my thought was that maybe it's an environmental variable we just need to use in the workflow. I'm planning on setting a run directory and then using a job_card from the rt.sh (appropriately changed) to see if that will run. Eitherway I'll get a run directory w/job_card at the end of it.
I do know that you can get that job submission script dumped out but I haven't done that in forever, I'll see if I can dig out those instructions.
I do know that you can get that job submission script dumped out but I haven't done that in forever, I'll see if I can dig out those instructions.
Thanks, Jessica. I was hoping to be able to use Forge DDT and MAP to see what is going on. A self-contained run directory will be very helpful for this.
Check the following section in the log file, compare to p7b rt run, and update HERA.env to increase stack sizes if needed, add or remove certain env variable
0 + . /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/UFS-P7c/env/HERA.env fcst 00 + '[' 1 -ne 1 ']' 00 + step=fcst 00 + export npe_node_max=40 00 + npe_node_max=40 00 + export 'launcher=srun --export=ALL' 00 + launcher='srun --export=ALL' 00 + export OMP_STACKSIZE=2048000 00 + OMP_STACKSIZE=2048000 00 + export NTHSTACK=1024000000 00 + NTHSTACK=1024000000 00 + ulimit -s unlimited 00 + ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1540672 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 94208000 open files (-n) 131072 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1540672 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
On Fri, Aug 13, 2021 at 12:29 PM Dom Heinzeller @.***> wrote:
I do know that you can get that job submission script dumped out but I haven't done that in forever, I'll see if I can dig out those instructions.
Thanks, Jessica. I was hoping to be able to use Forge DDT and MAP to see what is going on. A self-contained run directory will be very helpful for this.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-898581853, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY5N2LTL3WCANKYZHKHEP3T4VB67ANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
-- Fanglin Yang, Ph.D. Chief, Model Physics Group Modeling and Data Assimilation Branch
NOAA/NWS/NCEP Environmental Modeling Center
https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/ https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/
@yangfanglin I agree it's likely something in the workflow's HERA.env file that needs to be updated, in a log file for p7b output I found (/scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_73915/cpld_bmark_wave_v16_p7b_35d_2013040100/err) :
but the OMP_STACKSIZE seems larger in the workflow, so? I'm working on setting up the canned case now. Hopefully will have it soon.
I've created a canned case on hera here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/CannedCaseInput
My hope is that you can copy this directory to yours and then just "sbatch job_card" but it hasn't been tested yet, so not 100% sure this works yet. The job_card is from rt.sh -- which is what Rahul suggested earlier and would be testing along the same lines as Fanglin was suggesting with it perhaps being an environment variable issue. I'll update the issue after my test goes through.
The canned case is running for me now (the first time I submitted I had a module load error, but resubmission worked so?). Now we'll have to wait a couple of hours to see if the different environmental variables mean we don't get the same memory errors.
The canned case is running for me now (the first time I submitted I had a module load error, but resubmission worked so?). Now we'll have to wait a couple of hours to see if the different environmental variables mean we don't get the same memory errors.
Great progress! I'll wait for the outcome of your experiment before spending time on this.
See the output folder /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try02:
On day 18 in the err file we have:
472: forrtl: severe (174): SIGSEGV, segmentation fault occurred
472: Image PC Routine Line Source
472: ufs_model 000000000506C6BC Unknown Unknown Unknown
472: libpthread-2.17.s 00002B3D55DFF630 Unknown Unknown Unknown
472: libmpi.so.12 00002B3D55471AF9 MPI_Irecv Unknown Unknown
472: libmpifort.so.12. 00002B3D54EA32A0 mpi_irecv Unknown Unknown
472: ufs_model 00000000041A7FCB mpp_mod_mp_mpp_tr 126 mpp_transmit_mpi.h
472: ufs_model 00000000041DEE25 mpp_mod_mp_mpp_re 170 mpp_transmit.inc
472: ufs_model 0000000004338962 mpp_domains_mod_m 713 mpp_group_update.h
472: ufs_model 000000000245F3C0 fv_mp_mod_mp_star 762 fv_mp_mod.F90
472: ufs_model 00000000020CDEC2 dyn_core_mod_mp_d 931 dyn_core.F90
472: ufs_model 000000000211B93A fv_dynamics_mod_m 651 fv_dynamics.F90
472: ufs_model 000000000209AEC0 atmosphere_mod_mp 683 atmosphere.F90
472: ufs_model 0000000001FCBEAE atmos_model_mod_m 793 atmos_model.F90
472: ufs_model 0000000001E9BB0A module_fcst_grid_ 785 module_fcst_grid_comp.F90
So even with the environment variables used from rt.sh we still seem to be running into a memory problem. This log file does not have the explicit "ran out of memory" but I'm assuming that's the SIGTERM issue here. I missed the setting for turning the PET logs on with the esmf profile memory information so there will be a Try03 folder with that info soon.
Okay, so I went back and looked at all the log files from runs that @jiandewang made ( /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_201/logs/201/gfs.forecast.highres.log) and only one of those failed because of Out of Memory, the run I made with memory profiles turned on (/scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try03) does not seem to be any more than normal? I have seen memory errors fail as the SIGSEGV before, but I guess I'm wondering if we have a memory error or something else?
the numbers in /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/EXPROOT/R_20120101/config.fv3 do not add up. npe_fv3 cannot be 288 if layout_x_gfs=12 and layout_y_gfs=16. The setting WRTTASK_PER_GROUP_GFS=88 is also odd. You may want to increase WRITE_GROUP_GFS as well.
@yangfanglin this is probably an issue of the old versus CROW configuration, the values used in the forecast directory seem fine to me (/scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/DATAROOT/R_20120101/2012010100/gfs/fcst.125814): In nems.configure: MED_petlist_bounds: 0 1151 ATM_petlist_bounds: 0 1239
in input.nml: &fv_core_nml layout = 12,16 io_layout = 1,1
in model_configure: write_groups: 1 write_tasks_per_group: 88
And 12166=1152 (which is the # in mediator pet list in nems.configure) and +88 = 1240 (which matches the atm pet list)
The 88 might be an odd number but it means that the write group is filling out an entire node and not sharing with another component -- this is the configuration I got to run (after having memory problems w/the write group) for p6.
@yangfanglin since we only write output every 6 hours, having 1 write group has always been sufficient in terms of writing efficiency, is there some reason to have multiple write groups for memory?
Okay, so I went back and looked at all the log files from runs that @jiandewang made ( /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_201/logs/201/gfs.forecast.highres.log) and only one of those failed because of Out of Memory, the run I made with memory profiles turned on (/scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try03) does not seem to be any more than normal? I have seen memory errors fail as the SIGSEGV before, but I guess I'm wondering if we have a memory error or something else?
@JessicaMeixner-NOAA the error in log file depends on which node being detected by system that is having issue so they will not be the same. We are lucky that one of the log file contains "out of memory" info. The fact that all the jobs were being killed by system is a clean indication that there is some memory issue.
I think we can double the threads to check if it is a memory issue, right?
@bingfu-NOAA right now we are using 2 threads and model died at day 18, using 4 threading will slow down the system and we will not be able to finish 35day run in 8hr. In fact in one of my testing, I used 225s for fv3 and model died at day 13.
The test that made the 4thread slow down was because I also used a different layout for atm model trying to not use double the nodes. I can try one test with just increasing the thread count (which shouldn't in theory slow it down) just to see if it's really memory or not. It'll probably take a while to get through the queue, but will report back when I have results.
Okay, it does not appear that the 4thread slow down was just because I used a smaller atm layout, even using the same atm layout, it's much slower. I don't think we'll make it to the 18 days we reached with 2 threads.
Are all the components using same number of threads? Otherwise it won't help to increase threads for one component. Also does the PET log files show that memory is increasing during the integration? If yes, which component is it?
On Mon, Aug 16, 2021 at 3:17 PM Jessica Meixner @.***> wrote:
Okay, it does not appear that the 4thread slow down was just because I used a smaller atm layout, even using the same atm layout, it's much slower. I don't think we'll make it to the 18 days we reached with 2 threads.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-899756563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TK3B57FJMDEDTMJEZTT5FP3ZANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Yes, all the components are using the same number of threads, and the simulation slows down which I would not expect.
Yes, the PET log files show that memory is increasing during the integration. You can find that for example here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try03 for an atm pet, a write group pet, and ocean. Ice and wave do not have any memory information available. That is a 2 thread job.
The 4 thread run directory can be seen here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thr4/DATAROOT/testthr4/2013040100/gfs/fcst.25077 with log file here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thr4/COMROOT/testthr4/logs/2013040100/gfs.forecast.highres.log which only got to day 12 before being killed because the 8 hour wall clock is over.
I was able to run a successful 35 day run (the same as the canned case on hera, but through the workflow) on Orion. I did try to just update to the most recent version of ufs-weather-model on hera, and confirmed that also is dying with SIGTERM errors.
I ran a test where I set FHMAX=840 (my way of turning off I/O for the atm model) and the model still failed at day 18 (the first run died with a failed node also on day 18).
Based on suggestions from the coupling tag-up, the next steps I will try will be to: -- Turn off waves -- Turn debug on (without waves) -- Run CMEPS on different tasks -- Turn on/off different recently added options from p7c that were not in p7b -- Run 1 thread
All other suggestions are welcome. I'll report on results as I get them.
As expected, running with 1 thread we only got through 6 days of simulation: Rundir: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thread1/DATAROOT/thread01/2013040100/gfs/fcst.207953 log: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thread1/COMROOT/thread01/logs/2013040100/gfs.forecast.highres.log
The run without waves is still running, Rundir: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/nowave/DATAROOT/nowave02/2013040100/gfs/fcst.154732 log: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/nowave/COMROOT/nowave02/logs/2013040100/gfs.forecast.highres.log
Running with different atm physics settings (most of the jobs are still in the queue): With lheatstrg and lseaspray set to false: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try08nolheat With do_ca false: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try07noca Without MERRA2: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try06nomerra
A job running with debug is in the queue. I'll post more updates when I have them.
The run turning do_ca=false succeeded in running 35 days, all my other tests so far have failed. In the log files with do_ca=true, there are lots of statements such as:
192: CA cubic mosaic domain decomposition
192: whalo = 1, ehalo = 1, shalo = 1, nhalo = 1
192: X-AXIS = 320 320 320 320 320 320 320 320 320 320 320 320
192: Y-AXIS = 240 240 240 240 240 240 240 240 240 240 240 240 240 240 240 240
However, if you look at the log file for "domain decomposition" this is only written once for different "MOM" and "Cubic" variables. I'm trying to see if I can add memory profile statements to see if this is an issue or not but could this maybe be only done once for ca @lisa-bengtsson? Any other ideas of where we might have memory leaks with do_ca=true?
Sorry, I have not seen that before, did the debug run indicate anything? It is great if you could add memory profile statements, the halo exchange is in update_ca.F90 in the routine evolve_ca_sgs, that could be a start perhaps?
The routine is called update_cells_sgs inside update_ca.F90.
My suspicion is that in cellular_automata_sgs.F90 I set up this higher resolution CA domain:
!Get CA domain
call define_ca_domain(domain,domain_ncellx,ncells,nxncells,nyncells)
call mpp_get_data_domain (domain_ncellx,isdnx,iednx,jsdnx,jednx)
call mpp_get_compute_domain (domain_ncellx,iscnx,iecnx,jscnx,jecnx)
!write(1000+mpp_pe(),) "nxncells,nyncells: ",nxncells,nyncells
!write(1000+mpp_pe(),) "iscnx,iecnx,jscnx,jecnx: ",iscnx,iecnx,jscnx,jecnx
!write(1000+mpp_pe(),*) "isdnx,iednx,jsdnx,jednx: ",isdnx,iednx,jsdnx,jednx
nxc = iecnx-iscnx+1 nyc = jecnx-jscnx+1 nxch = iednx-isdnx+1 nych = jednx-jsdnx+1 nx_full=int(ncells,kind=8)int(npx-1,kind=8) ny_full=int(ncells,kind=8)int(npy-1,kind=8)
This is called each time-step, but only has to be called once, I will do a test where I put this in a (if first time step) condition and save the domain_ncellx information. I will get back to your shortly.
@JessicaMeixner-NOAA what should I look for in the log file in terms of evidence of memory leak?
Lisa,
Set "print_esmf: .true." in model_configure before you run the model. Then check PET*.ESMF_LogFile to see memory usage after the run is completed. See /scratch1/NCEPDEV/stmp2/Fanglin.Yang/RUNDIRS/gfsv17_c384/2019070100/gfs/fcst.120319 as an example of an atmos-only run.
On Thu, Aug 19, 2021 at 10:47 AM lisa-bengtsson @.***> wrote:
@JessicaMeixner-NOAA https://github.com/JessicaMeixner-NOAA what should I look for in the log file in terms of evidence of memory leak?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-901977859, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY5N2ODFVDEFOYG5H2K67DT5UKOVANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
-- Fanglin Yang, Ph.D. Chief, Model Physics Group Modeling and Data Assimilation Branch
NOAA/NWS/NCEP Environmental Modeling Center
https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/ https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/
In addition to setting print_esmf: .true. in model_configure, in nems.congfigure set "ProfileMemory = true" for each of the components.
Then looking at memory in the PET logs I do: grep 'Total allocated space' PET0000.ESMF_LogFile > mem.0000
Lisa, you don't need "ProfileMemory = true" for standalone atm run, but you do need to run longer time to see the memory increase (>2 days with ca turned on) .
On Thu, Aug 19, 2021 at 11:00 AM Jessica Meixner @.***> wrote:
In addition to setting print_esmf: .true. in model_configure, in nems.congfigure set "ProfileMemory = true" for each of the components.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-901988283, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TKLAWMSCNET3YXGGK3T5UL73ANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
For a 12 hour forecast it doesn't look like anything really changed with this update unfortunately:
In control: grep 'Total allocated space' PET000.ESMF_LogFile
20210819 142453.131 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: Total allocated space (bytes): 30389584
grep 'Total allocated space' PET149.ESMF_LogFile 20210819 142505.316 INFO PET149 Leaving FV3 ModelAdvance_phase1: - MemInfo: Total allocated space (bytes): 60189760
In updated code
grep 'Total allocated space' PET000.ESMF_LogFile 20210819 151612.819 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: Total allocated space (bytes): 30394944
grep 'Total allocated space' PET149.ESMF_LogFile 20210819 151624.023 INFO PET149 Leaving FV3 ModelAdvance_phase1: - MemInfo: Total allocated space (bytes): 61611440
Is the expectation that the total allocated space should not increase between PET000 and PET149?
I can run 3 days and see if that changes anything.
@junwang-noaa I saw your email about MOM6, but still thought it could be worth understanding if any memory leak in the CA can be prevented. To see this 2% increase you mentioned over 14 days, do you compare the beginning of the PET*ESMF_LogFile to the end value? It is confusing, because the time stamp is not in order? (if that is what it is in the first column?) What are 0-149 values after PET in the file name? Thanks.
Lisa, Let me clarify, in the run without CA we see slight memory increase (%2) in 35 days. But with CA turned on, the memory doubled in 35 days. So we do need to resolve the issue with CA in order to run the P7 with CA on hera. Please look at the "VmPeak" values ( the maximum amount of memory the process has used since it was started) in the forecast task PET files,e.g. PET0000.ESMF_LogFile. VmPeak should not increase during the forecast time.
On Thu, Aug 19, 2021 at 11:47 AM lisa-bengtsson @.***> wrote:
@junwang-noaa https://github.com/junwang-noaa I saw your email about MOM6, but still thought it could be worth understanding if any memory leak in the CA can be prevented. To see this 2% increase you mentioned over 14 days, do you compare the beginning of the PET*ESMF_LogFile to the end value? It is confusing, because the time stamp is not in order? (if that is what it is in the first column?) What are 0-149 values after PET in the file name? Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902024791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TIWWMMEOG6DUKTO2BDT5URQPANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Ok, thanks for clarifying, I will have a look
@junwang-noaa I checked your run at /scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_34396/cpld_control_p7, don't see memory increase here 20210819 160833.911 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160839.352 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160846.099 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160852.398 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160858.618 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160904.881 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160926.508 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160932.089 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB
@junwang-noaa I don't see VmPeak increasing, but its value is reduced with the updated code I mentioned above, Phil has a unit test working, so we will do some debug test in that stand alone version which is quicker.
Control: 20210819 153520.209 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.209 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.249 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.250 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.860 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.861 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.876 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.876 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB
Updated code 20210819 153520.336 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.336 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.356 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.357 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.937 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.938 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.958 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB
Jiande and Lisa, the memory info is printed out at every time step, you can see the memory increase by comparing the VmPeak numbers at the beginning and end of the forecast run.
In the coupled C96 run (/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_34396/cpld_control_p7), for atm tasks, PET000.ESMF_LogFile, we have for 16 days:
20210819 153618.006 INFO PET000 Leaving FV3 ModelAdvance: - MemInfo: VmPeak: 1587628 kB ... ... 20210819 162837.130 INFO PET000 Leaving FV3 ModelAdvance: - MemInfo: VmPeak: 1936048 kB
For MOM6:
20210819 153633.225 INFO PET150 Leaving MOM update_ocean_model:
In Jessica's C384 run without CA:
20210818 211035.500 INFO PET1240 Leaving MOM Model_ADVANCE: - MemInfo: VmPeak: 2141384 kB ... 20210819 012900.523 INFO PET1240 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 2693936 kB 20210819 013216.552 INFO PET1240 Leaving MOM Model_ADVANCE: - MemInfo: VmPeak: 4103020 kB
On Thu, Aug 19, 2021 at 12:24 PM lisa-bengtsson @.***> wrote:
@junwang-noaa https://github.com/junwang-noaa I don't see VmPeak increasing, but its value is reduced with the updated code I mentioned above, Phil has a unit test working, so we will do some debug test in that stand alone version which is quicker.
Control: 20210819 153520.209 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.209 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.249 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.250 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.860 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.861 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.876 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.876 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB
Updated code 20210819 153520.336 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.336 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.356 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.357 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.937 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.938 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.958 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902054049, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TM7QXWDUNMXOVXGFYLT5UV2JANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
@junwang-noaa @JessicaMeixner-NOAA we can confirm that the updates I proposed fixes the CA memory leak. I ran a 35 day forecast C96 and VmPeak starts out at the same value but ends at 1413068 kB (continues to increase) and at 1256320 kB in the update (not increasing). @pjpegion also confirmed in the stand alone test that the updated code solved the memory increase. I will make a branch with this single fix - maybe you can try it Jessica? It doesn't change any baselines so maybe could get in quick again?
This was information from Phil: "I can confirm the memory leak is fixed. In the original code, the node starts off with 86GB of memory free. When I start the run, the code used 1 GB at the start, so 85 GB are free. After 2400 time-steps there is only 4.7 GB free. In the fixed code, the amount of memory available is steady at 85 GB."
Lisa, how many days have you run the tests?
On Thu, Aug 19, 2021 at 1:04 PM lisa-bengtsson @.***> wrote:
@junwang-noaa https://github.com/junwang-noaa @JessicaMeixner-NOAA https://github.com/JessicaMeixner-NOAA we can confirm that the updates I proposed fixes the CA memory leak. I ran a 35 day forecast C96 and VmPeak starts out at the same value but ends at 1413068 kB (continues to increase) and at 1256320 kB in the update (not increasing). @pjpegion https://github.com/pjpegion also confirmed in the stand alone test that the updated code solved the memory increase. I will make a branch with this single fix - maybe you can try it Jessica? It doesn't change any baselines so maybe could get in quick again?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902086525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TLXSX44YGJYQOAVPGLT5U2SHANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
@JessicaMeixner-NOAA do you think you have time to try this fix in your setup, it is only a change in one routine in the stochastic_physics submodule: https://github.com/lisa-bengtsson/stochastic_physics/tree/bugfix/memoryleak
If not I can try to checkout the whole P7c and test, but would need some help in the steps doing so.
@lisa-bengtsson I would be happy to test
@junwang-noaa I will check closer, the run timed out due to walltime, so didn't finish. I can see the VmPeak values continue to grow in the control_ca run but not in the updated run, this was atmosphere only C96.
In my unit test of the stochastic physics code. I run 24,000 time steps. In the version with the memory leak, The amount of memory dropped from 85 GB at the start of my test, and down to 4.7 GB after 24,000 time-steps. The code fix shows the amount of free memory steady at 85 GB. Also, it produces identical results.
Phil, thanks for the results.
On Thu, Aug 19, 2021 at 1:41 PM Phil Pegion @.***> wrote:
In my unit test of the stochastic physics code. I run 24,000 time steps. In the version with the memory leak, The amount of memory dropped from 85 GB at the start of my test, and down to 4.7 GB after 24,000 time-steps. The code fix shows the amount of free memory steady at 85 GB. Also, it produces identical results.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902112824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TO4OHQ66FJ474CMSLLT5U62PANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
@junwang-noaa it timed out after 17 days, the run directories are on Hera for the control ca:
/scratch2/BMC/rem/Lisa.Bengtsson/stmp2/Lisa.Bengtsson/FV3_RT/CA_UNIT_TEST/control_ca
and the updated run: /scratch2/BMC/rem/Lisa.Bengtsson/stmp2/Lisa.Bengtsson/FV3_RT/CA_UNIT_TEST/control_ca_memory
The results look good, thanks Lisa.
On Thu, Aug 19, 2021 at 1:51 PM lisa-bengtsson @.***> wrote:
@junwang-noaa https://github.com/junwang-noaa it timed out after 17 days, the run directories are on Hera for the control ca:
/scratch2/BMC/rem/Lisa.Bengtsson/stmp2/Lisa.Bengtsson/FV3_RT/CA_UNIT_TEST/control_ca
and the updated run:
/scratch2/BMC/rem/Lisa.Bengtsson/stmp2/Lisa.Bengtsson/FV3_RT/CA_UNIT_TEST/control_ca_memory
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902118965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TPLI5JJ73WNAHTYRJ3T5VAADANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Description
All UFS P7c runs (using workflow) failed at day 18 (using 300s for fv3) or day 13(using 225s for fv3), most likely due to memory leak.
To Reproduce:
git clone https://github.com/NOAA-EMC/global-workflow cd global-workflow git checkout feature/coupled-crow git submodule update --init --recursive sh checkout.sh -c sh build_all.sh -c sh link_fv3gfs.sh emc hera coupled
and then use the "prototype7" case file.
Additional context
Add any other context about the problem here. Directly reference any issues or PRs in this or other repositories that this is related to, and describe how they are related. Example:
Output
Screenshots If applicable, drag and drop screenshots to help explain your problem.
output logs one sample run log is saved at /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/UFS-P7c/LOG/gfs.forecast.highres.log.0, error information is around line 297663. _slurmstepd: error: Detected 1 oom-kill event(s) in StepId=21542673.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: h34m17: task 473: Out Of Memory srun: launch/slurm: _stepsignal: Terminating StepId=21542673.0 slurmstepd: error: *** STEP 21542673.0 ON h33m12 CANCELLED AT 2021-08-11T23:57:15
PET file can be found at /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/UFS-P7c/LOG/PET