UFS P7c memory issue - Githubissues

jiandewang commented 2 years ago

Description

All UFS P7c runs (using workflow) failed at day 18 (using 300s for fv3) or day 13(using 225s for fv3), most likely due to memory leak.

To Reproduce:

git clone https://github.com/NOAA-EMC/global-workflow cd global-workflow git checkout feature/coupled-crow git submodule update --init --recursive sh checkout.sh -c sh build_all.sh -c sh link_fv3gfs.sh emc hera coupled

and then use the "prototype7" case file.

Additional context

Add any other context about the problem here. Directly reference any issues or PRs in this or other repositories that this is related to, and describe how they are related. Example:

needs to be fixed also in noaa-emc/nems/issues/
needed for noaa-emc/fv3atm/pull/

Output

Screenshots If applicable, drag and drop screenshots to help explain your problem.

output logs one sample run log is saved at /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/UFS-P7c/LOG/gfs.forecast.highres.log.0, error information is around line 297663. _slurmstepd: error: Detected 1 oom-kill event(s) in StepId=21542673.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: h34m17: task 473: Out Of Memory srun: launch/slurm: _stepsignal: Terminating StepId=21542673.0 slurmstepd: error: *** STEP 21542673.0 ON h33m12 CANCELLED AT 2021-08-11T23:57:15

PET file can be found at /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/UFS-P7c/LOG/PET

paste the code here (if a short section of log)

DomHeinzeller commented 2 years ago

@jiandewang in order to investigate this, we (@DeniseWorthen and @climbfuji) need a fully self-contained run directory that we can work with. That means an experiment directory with all input files, configuration files, and the job submission script. Can you provide this on hera, please? Thanks.

jiandewang commented 2 years ago

run dir which contains all input and configuration files: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/DATAROOT/R_20120101/2012010100/gfs/fcst.125814

run log: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_20120101/logs/2012010100/gfs.forecast.highres.log.0

this is through workflow thus there is no job_card (as in rt.sh) in run dir

climbfuji commented 2 years ago

run dir which contains all input and configuration files: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/DATAROOT/R_20120101/2012010100/gfs/fcst.125814

run log: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_20120101/logs/2012010100/gfs.forecast.highres.log.0

this is through workflow thus there is no job_card (as in rt.sh) in run dir

I will not be able to work on this unless I get a job submission script. I believe rocoto can dump it out using some verbose flag. @JessicaMeixner-NOAA knows.

JessicaMeixner-NOAA commented 2 years ago

So I printed out the profile memory from the p7b runs and the memory usage is less in the runs from workflow, so my thought was that maybe it's an environmental variable we just need to use in the workflow. I'm planning on setting a run directory and then using a job_card from the rt.sh (appropriately changed) to see if that will run. Eitherway I'll get a run directory w/job_card at the end of it.

JessicaMeixner-NOAA commented 2 years ago

I do know that you can get that job submission script dumped out but I haven't done that in forever, I'll see if I can dig out those instructions.

climbfuji commented 2 years ago

I do know that you can get that job submission script dumped out but I haven't done that in forever, I'll see if I can dig out those instructions.

Thanks, Jessica. I was hoping to be able to use Forge DDT and MAP to see what is going on. A self-contained run directory will be very helpful for this.

yangfanglin commented 2 years ago

Check the following section in the log file, compare to p7b rt run, and update HERA.env to increase stack sizes if needed, add or remove certain env variable

0 + . /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/UFS-P7c/env/HERA.env fcst 00 + '[' 1 -ne 1 ']' 00 + step=fcst 00 + export npe_node_max=40 00 + npe_node_max=40 00 + export 'launcher=srun --export=ALL' 00 + launcher='srun --export=ALL' 00 + export OMP_STACKSIZE=2048000 00 + OMP_STACKSIZE=2048000 00 + export NTHSTACK=1024000000 00 + NTHSTACK=1024000000 00 + ulimit -s unlimited 00 + ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1540672 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 94208000 open files (-n) 131072 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1540672 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

On Fri, Aug 13, 2021 at 12:29 PM Dom Heinzeller @.***> wrote:

I do know that you can get that job submission script dumped out but I haven't done that in forever, I'll see if I can dig out those instructions.

Thanks, Jessica. I was hoping to be able to use Forge DDT and MAP to see what is going on. A self-contained run directory will be very helpful for this.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-898581853, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY5N2LTL3WCANKYZHKHEP3T4VB67ANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Fanglin Yang, Ph.D. Chief, Model Physics Group Modeling and Data Assimilation Branch

NOAA/NWS/NCEP Environmental Modeling Center

https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/ https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/

JessicaMeixner-NOAA commented 2 years ago

@yangfanglin I agree it's likely something in the workflow's HERA.env file that needs to be updated, in a log file for p7b output I found (/scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_73915/cpld_bmark_wave_v16_p7b_35d_2013040100/err) :

echo 'Model started: ' Fri Aug 13 02:34:11 UTC 2021
export MPI_TYPE_DEPTH=20
MPI_TYPE_DEPTH=20
export OMP_STACKSIZE=512M
OMP_STACKSIZE=512M
export OMP_NUM_THREADS=2
OMP_NUM_THREADS=2
export ESMF_RUNTIME_COMPLIANCECHECK=OFF:depth=4
ESMF_RUNTIME_COMPLIANCECHECK=OFF:depth=4
export PSM_RANKS_PER_CONTEXT=4
PSM_RANKS_PER_CONTEXT=4
export PSM_SHAREDCONTEXTS=1
PSM_SHAREDCONTEXTS=1

but the OMP_STACKSIZE seems larger in the workflow, so? I'm working on setting up the canned case now. Hopefully will have it soon.

JessicaMeixner-NOAA commented 2 years ago

I've created a canned case on hera here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/CannedCaseInput

My hope is that you can copy this directory to yours and then just "sbatch job_card" but it hasn't been tested yet, so not 100% sure this works yet. The job_card is from rt.sh -- which is what Rahul suggested earlier and would be testing along the same lines as Fanglin was suggesting with it perhaps being an environment variable issue. I'll update the issue after my test goes through.

JessicaMeixner-NOAA commented 2 years ago

The canned case is running for me now (the first time I submitted I had a module load error, but resubmission worked so?). Now we'll have to wait a couple of hours to see if the different environmental variables mean we don't get the same memory errors.

DomHeinzeller commented 2 years ago

The canned case is running for me now (the first time I submitted I had a module load error, but resubmission worked so?). Now we'll have to wait a couple of hours to see if the different environmental variables mean we don't get the same memory errors.

Great progress! I'll wait for the outcome of your experiment before spending time on this.

JessicaMeixner-NOAA commented 2 years ago

See the output folder /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try02:

On day 18 in the err file we have:

 472: forrtl: severe (174): SIGSEGV, segmentation fault occurred
 472: Image              PC                Routine            Line        Source
 472: ufs_model          000000000506C6BC  Unknown               Unknown  Unknown
 472: libpthread-2.17.s  00002B3D55DFF630  Unknown               Unknown  Unknown
 472: libmpi.so.12       00002B3D55471AF9  MPI_Irecv             Unknown  Unknown
 472: libmpifort.so.12.  00002B3D54EA32A0  mpi_irecv             Unknown  Unknown
 472: ufs_model          00000000041A7FCB  mpp_mod_mp_mpp_tr         126  mpp_transmit_mpi.h
 472: ufs_model          00000000041DEE25  mpp_mod_mp_mpp_re         170  mpp_transmit.inc
 472: ufs_model          0000000004338962  mpp_domains_mod_m         713  mpp_group_update.h
 472: ufs_model          000000000245F3C0  fv_mp_mod_mp_star         762  fv_mp_mod.F90
 472: ufs_model          00000000020CDEC2  dyn_core_mod_mp_d         931  dyn_core.F90
 472: ufs_model          000000000211B93A  fv_dynamics_mod_m         651  fv_dynamics.F90
 472: ufs_model          000000000209AEC0  atmosphere_mod_mp         683  atmosphere.F90
 472: ufs_model          0000000001FCBEAE  atmos_model_mod_m         793  atmos_model.F90
 472: ufs_model          0000000001E9BB0A  module_fcst_grid_         785  module_fcst_grid_comp.F90

So even with the environment variables used from rt.sh we still seem to be running into a memory problem. This log file does not have the explicit "ran out of memory" but I'm assuming that's the SIGTERM issue here. I missed the setting for turning the PET logs on with the esmf profile memory information so there will be a Try03 folder with that info soon.

JessicaMeixner-NOAA commented 2 years ago

Okay, so I went back and looked at all the log files from runs that @jiandewang made ( /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_201/logs/201/gfs.forecast.highres.log) and only one of those failed because of Out of Memory, the run I made with memory profiles turned on (/scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try03) does not seem to be any more than normal? I have seen memory errors fail as the SIGSEGV before, but I guess I'm wondering if we have a memory error or something else?

yangfanglin commented 2 years ago

the numbers in /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/EXPROOT/R_20120101/config.fv3 do not add up. npe_fv3 cannot be 288 if layout_x_gfs=12 and layout_y_gfs=16. The setting WRTTASK_PER_GROUP_GFS=88 is also odd. You may want to increase WRITE_GROUP_GFS as well.

JessicaMeixner-NOAA commented 2 years ago

@yangfanglin this is probably an issue of the old versus CROW configuration, the values used in the forecast directory seem fine to me (/scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/DATAROOT/R_20120101/2012010100/gfs/fcst.125814): In nems.configure: MED_petlist_bounds: 0 1151 ATM_petlist_bounds: 0 1239

in input.nml: &fv_core_nml layout = 12,16 io_layout = 1,1

in model_configure: write_groups: 1 write_tasks_per_group: 88

And 12166=1152 (which is the # in mediator pet list in nems.configure) and +88 = 1240 (which matches the atm pet list)

The 88 might be an odd number but it means that the write group is filling out an entire node and not sharing with another component -- this is the configuration I got to run (after having memory problems w/the write group) for p6.

JessicaMeixner-NOAA commented 2 years ago

@yangfanglin since we only write output every 6 hours, having 1 write group has always been sufficient in terms of writing efficiency, is there some reason to have multiple write groups for memory?

jiandewang commented 2 years ago

Okay, so I went back and looked at all the log files from runs that @jiandewang made ( /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_201/logs/201/gfs.forecast.highres.log) and only one of those failed because of Out of Memory, the run I made with memory profiles turned on (/scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try03) does not seem to be any more than normal? I have seen memory errors fail as the SIGSEGV before, but I guess I'm wondering if we have a memory error or something else?

@JessicaMeixner-NOAA the error in log file depends on which node being detected by system that is having issue so they will not be the same. We are lucky that one of the log file contains "out of memory" info. The fact that all the jobs were being killed by system is a clean indication that there is some memory issue.

bingfu-NOAA commented 2 years ago

I think we can double the threads to check if it is a memory issue, right?

jiandewang commented 2 years ago

@bingfu-NOAA right now we are using 2 threads and model died at day 18, using 4 threading will slow down the system and we will not be able to finish 35day run in 8hr. In fact in one of my testing, I used 225s for fv3 and model died at day 13.

JessicaMeixner-NOAA commented 2 years ago

The test that made the 4thread slow down was because I also used a different layout for atm model trying to not use double the nodes. I can try one test with just increasing the thread count (which shouldn't in theory slow it down) just to see if it's really memory or not. It'll probably take a while to get through the queue, but will report back when I have results.

JessicaMeixner-NOAA commented 2 years ago

Okay, it does not appear that the 4thread slow down was just because I used a smaller atm layout, even using the same atm layout, it's much slower. I don't think we'll make it to the 18 days we reached with 2 threads.

junwang-noaa commented 2 years ago

Are all the components using same number of threads? Otherwise it won't help to increase threads for one component. Also does the PET log files show that memory is increasing during the integration? If yes, which component is it?

On Mon, Aug 16, 2021 at 3:17 PM Jessica Meixner @.***> wrote:

Okay, it does not appear that the 4thread slow down was just because I used a smaller atm layout, even using the same atm layout, it's much slower. I don't think we'll make it to the 18 days we reached with 2 threads.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-899756563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TK3B57FJMDEDTMJEZTT5FP3ZANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

JessicaMeixner-NOAA commented 2 years ago

Yes, all the components are using the same number of threads, and the simulation slows down which I would not expect.

Yes, the PET log files show that memory is increasing during the integration. You can find that for example here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try03 for an atm pet, a write group pet, and ocean. Ice and wave do not have any memory information available. That is a 2 thread job.

The 4 thread run directory can be seen here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thr4/DATAROOT/testthr4/2013040100/gfs/fcst.25077 with log file here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thr4/COMROOT/testthr4/logs/2013040100/gfs.forecast.highres.log which only got to day 12 before being killed because the 8 hour wall clock is over.

JessicaMeixner-NOAA commented 2 years ago

I was able to run a successful 35 day run (the same as the canned case on hera, but through the workflow) on Orion. I did try to just update to the most recent version of ufs-weather-model on hera, and confirmed that also is dying with SIGTERM errors.

JessicaMeixner-NOAA commented 2 years ago

I ran a test where I set FHMAX=840 (my way of turning off I/O for the atm model) and the model still failed at day 18 (the first run died with a failed node also on day 18).

Based on suggestions from the coupling tag-up, the next steps I will try will be to: -- Turn off waves -- Turn debug on (without waves) -- Run CMEPS on different tasks -- Turn on/off different recently added options from p7c that were not in p7b -- Run 1 thread

All other suggestions are welcome. I'll report on results as I get them.

JessicaMeixner-NOAA commented 2 years ago

As expected, running with 1 thread we only got through 6 days of simulation: Rundir: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thread1/DATAROOT/thread01/2013040100/gfs/fcst.207953 log: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thread1/COMROOT/thread01/logs/2013040100/gfs.forecast.highres.log

The run without waves is still running, Rundir: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/nowave/DATAROOT/nowave02/2013040100/gfs/fcst.154732 log: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/nowave/COMROOT/nowave02/logs/2013040100/gfs.forecast.highres.log

Running with different atm physics settings (most of the jobs are still in the queue): With lheatstrg and lseaspray set to false: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try08nolheat With do_ca false: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try07noca Without MERRA2: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try06nomerra

A job running with debug is in the queue. I'll post more updates when I have them.

JessicaMeixner-NOAA commented 2 years ago

The run turning do_ca=false succeeded in running 35 days, all my other tests so far have failed. In the log files with do_ca=true, there are lots of statements such as:

 192:  CA cubic mosaic domain decomposition
 192: whalo =    1, ehalo =    1, shalo =    1, nhalo =    1
 192:   X-AXIS =  320 320 320 320 320 320 320 320 320 320 320 320
 192:   Y-AXIS =  240 240 240 240 240 240 240 240 240 240 240 240 240 240 240 240

However, if you look at the log file for "domain decomposition" this is only written once for different "MOM" and "Cubic" variables. I'm trying to see if I can add memory profile statements to see if this is an issue or not but could this maybe be only done once for ca @lisa-bengtsson? Any other ideas of where we might have memory leaks with do_ca=true?

lisa-bengtsson commented 2 years ago

Sorry, I have not seen that before, did the debug run indicate anything? It is great if you could add memory profile statements, the halo exchange is in update_ca.F90 in the routine evolve_ca_sgs, that could be a start perhaps?

lisa-bengtsson commented 2 years ago

The routine is called update_cells_sgs inside update_ca.F90.

lisa-bengtsson commented 2 years ago

My suspicion is that in cellular_automata_sgs.F90 I set up this higher resolution CA domain:

!Get CA domain
call define_ca_domain(domain,domain_ncellx,ncells,nxncells,nyncells) call mpp_get_data_domain (domain_ncellx,isdnx,iednx,jsdnx,jednx) call mpp_get_compute_domain (domain_ncellx,iscnx,iecnx,jscnx,jecnx) !write(1000+mpp_pe(),) "nxncells,nyncells: ",nxncells,nyncells
!write(1000+mpp_pe(),) "iscnx,iecnx,jscnx,jecnx: ",iscnx,iecnx,jscnx,jecnx
!write(1000+mpp_pe(),*) "isdnx,iednx,jsdnx,jednx: ",isdnx,iednx,jsdnx,jednx

nxc = iecnx-iscnx+1 nyc = jecnx-jscnx+1 nxch = iednx-isdnx+1 nych = jednx-jsdnx+1 nx_full=int(ncells,kind=8)int(npx-1,kind=8) ny_full=int(ncells,kind=8)int(npy-1,kind=8)

This is called each time-step, but only has to be called once, I will do a test where I put this in a (if first time step) condition and save the domain_ncellx information. I will get back to your shortly.

lisa-bengtsson commented 2 years ago

@JessicaMeixner-NOAA what should I look for in the log file in terms of evidence of memory leak?

yangfanglin commented 2 years ago

Lisa,

Set "print_esmf: .true." in model_configure before you run the model. Then check PET*.ESMF_LogFile to see memory usage after the run is completed. See /scratch1/NCEPDEV/stmp2/Fanglin.Yang/RUNDIRS/gfsv17_c384/2019070100/gfs/fcst.120319 as an example of an atmos-only run.

On Thu, Aug 19, 2021 at 10:47 AM lisa-bengtsson @.***> wrote:

@JessicaMeixner-NOAA https://github.com/JessicaMeixner-NOAA what should I look for in the log file in terms of evidence of memory leak?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-901977859, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY5N2ODFVDEFOYG5H2K67DT5UKOVANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Fanglin Yang, Ph.D. Chief, Model Physics Group Modeling and Data Assimilation Branch

NOAA/NWS/NCEP Environmental Modeling Center

https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/ https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/

JessicaMeixner-NOAA commented 2 years ago

In addition to setting print_esmf: .true. in model_configure, in nems.congfigure set "ProfileMemory = true" for each of the components.

Then looking at memory in the PET logs I do: grep 'Total allocated space' PET0000.ESMF_LogFile > mem.0000

junwang-noaa commented 2 years ago

Lisa, you don't need "ProfileMemory = true" for standalone atm run, but you do need to run longer time to see the memory increase (>2 days with ca turned on) .

On Thu, Aug 19, 2021 at 11:00 AM Jessica Meixner @.***> wrote:

In addition to setting print_esmf: .true. in model_configure, in nems.congfigure set "ProfileMemory = true" for each of the components.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-901988283, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TKLAWMSCNET3YXGGK3T5UL73ANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

lisa-bengtsson commented 2 years ago

For a 12 hour forecast it doesn't look like anything really changed with this update unfortunately:

In control: grep 'Total allocated space' PET000.ESMF_LogFile

20210819 142453.131 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: Total allocated space (bytes): 30389584

grep 'Total allocated space' PET149.ESMF_LogFile 20210819 142505.316 INFO PET149 Leaving FV3 ModelAdvance_phase1: - MemInfo: Total allocated space (bytes): 60189760

In updated code

grep 'Total allocated space' PET000.ESMF_LogFile 20210819 151612.819 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: Total allocated space (bytes): 30394944

grep 'Total allocated space' PET149.ESMF_LogFile 20210819 151624.023 INFO PET149 Leaving FV3 ModelAdvance_phase1: - MemInfo: Total allocated space (bytes): 61611440

Is the expectation that the total allocated space should not increase between PET000 and PET149?

I can run 3 days and see if that changes anything.

lisa-bengtsson commented 2 years ago

@junwang-noaa I saw your email about MOM6, but still thought it could be worth understanding if any memory leak in the CA can be prevented. To see this 2% increase you mentioned over 14 days, do you compare the beginning of the PET*ESMF_LogFile to the end value? It is confusing, because the time stamp is not in order? (if that is what it is in the first column?) What are 0-149 values after PET in the file name? Thanks.

junwang-noaa commented 2 years ago

Lisa, Let me clarify, in the run without CA we see slight memory increase (%2) in 35 days. But with CA turned on, the memory doubled in 35 days. So we do need to resolve the issue with CA in order to run the P7 with CA on hera. Please look at the "VmPeak" values ( the maximum amount of memory the process has used since it was started) in the forecast task PET files,e.g. PET0000.ESMF_LogFile. VmPeak should not increase during the forecast time.

On Thu, Aug 19, 2021 at 11:47 AM lisa-bengtsson @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa I saw your email about MOM6, but still thought it could be worth understanding if any memory leak in the CA can be prevented. To see this 2% increase you mentioned over 14 days, do you compare the beginning of the PET*ESMF_LogFile to the end value? It is confusing, because the time stamp is not in order? (if that is what it is in the first column?) What are 0-149 values after PET in the file name? Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902024791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TIWWMMEOG6DUKTO2BDT5URQPANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

lisa-bengtsson commented 2 years ago

Ok, thanks for clarifying, I will have a look

jiandewang commented 2 years ago

@junwang-noaa I checked your run at /scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_34396/cpld_control_p7, don't see memory increase here 20210819 160833.911 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160839.352 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160846.099 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160852.398 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160858.618 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160904.881 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160926.508 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB 20210819 160932.089 INFO PET154 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 1329676 kB

lisa-bengtsson commented 2 years ago

@junwang-noaa I don't see VmPeak increasing, but its value is reduced with the updated code I mentioned above, Phil has a unit test working, so we will do some debug test in that stand alone version which is quicker.

Control: 20210819 153520.209 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.209 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.249 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.250 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.860 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.861 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.876 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.876 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB

Updated code 20210819 153520.336 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.336 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.356 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.357 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.937 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.938 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.958 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB

junwang-noaa commented 2 years ago

Jiande and Lisa, the memory info is printed out at every time step, you can see the memory increase by comparing the VmPeak numbers at the beginning and end of the forecast run.

In the coupled C96 run (/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_34396/cpld_control_p7), for atm tasks, PET000.ESMF_LogFile, we have for 16 days:

20210819 153618.006 INFO PET000 Leaving FV3 ModelAdvance: - MemInfo: VmPeak: 1587628 kB ... ... 20210819 162837.130 INFO PET000 Leaving FV3 ModelAdvance: - MemInfo: VmPeak: 1936048 kB

For MOM6:

20210819 153633.225 INFO PET150 Leaving MOM update_ocean_model:

MemInfo: VmPeak: 1712880 kB ... 20210819 162815.506 INFO PET150 Leaving MOM update_ocean_model:
MemInfo: VmPeak: 1764020 kB

In Jessica's C384 run without CA:

20210818 211035.500 INFO PET1240 Leaving MOM Model_ADVANCE: - MemInfo: VmPeak: 2141384 kB ... 20210819 012900.523 INFO PET1240 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 2693936 kB 20210819 013216.552 INFO PET1240 Leaving MOM Model_ADVANCE: - MemInfo: VmPeak: 4103020 kB

On Thu, Aug 19, 2021 at 12:24 PM lisa-bengtsson @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa I don't see VmPeak increasing, but its value is reduced with the updated code I mentioned above, Phil has a unit test working, so we will do some debug test in that stand alone version which is quicker.

Control: 20210819 153520.209 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.209 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.249 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.250 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.860 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB 20210819 153520.861 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.876 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1263984 kB 20210819 153520.876 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1263984 kB

Updated code 20210819 153520.336 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.336 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.356 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.357 INFO PET000 Entering FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.937 INFO PET000 Leaving FV3 ModelAdvance_phase1: - MemInfo: VmPeak: 1260408 kB 20210819 153520.938 INFO PET000 Entering FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB 20210819 153520.958 INFO PET000 Leaving FV3 ModelAdvance_phase2: - MemInfo: VmPeak: 1260408 kB

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902054049, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TM7QXWDUNMXOVXGFYLT5UV2JANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

lisa-bengtsson commented 2 years ago

@junwang-noaa @JessicaMeixner-NOAA we can confirm that the updates I proposed fixes the CA memory leak. I ran a 35 day forecast C96 and VmPeak starts out at the same value but ends at 1413068 kB (continues to increase) and at 1256320 kB in the update (not increasing). @pjpegion also confirmed in the stand alone test that the updated code solved the memory increase. I will make a branch with this single fix - maybe you can try it Jessica? It doesn't change any baselines so maybe could get in quick again?

This was information from Phil: "I can confirm the memory leak is fixed. In the original code, the node starts off with 86GB of memory free. When I start the run, the code used 1 GB at the start, so 85 GB are free. After 2400 time-steps there is only 4.7 GB free. In the fixed code, the amount of memory available is steady at 85 GB."

junwang-noaa commented 2 years ago

Lisa, how many days have you run the tests?

On Thu, Aug 19, 2021 at 1:04 PM lisa-bengtsson @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa @JessicaMeixner-NOAA https://github.com/JessicaMeixner-NOAA we can confirm that the updates I proposed fixes the CA memory leak. I ran a 35 day forecast C96 and VmPeak starts out at the same value but ends at 1413068 kB (continues to increase) and at 1256320 kB in the update (not increasing). @pjpegion https://github.com/pjpegion also confirmed in the stand alone test that the updated code solved the memory increase. I will make a branch with this single fix - maybe you can try it Jessica? It doesn't change any baselines so maybe could get in quick again?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902086525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TLXSX44YGJYQOAVPGLT5U2SHANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

lisa-bengtsson commented 2 years ago

@JessicaMeixner-NOAA do you think you have time to try this fix in your setup, it is only a change in one routine in the stochastic_physics submodule: https://github.com/lisa-bengtsson/stochastic_physics/tree/bugfix/memoryleak

If not I can try to checkout the whole P7c and test, but would need some help in the steps doing so.

JessicaMeixner-NOAA commented 2 years ago

@lisa-bengtsson I would be happy to test

lisa-bengtsson commented 2 years ago

@junwang-noaa I will check closer, the run timed out due to walltime, so didn't finish. I can see the VmPeak values continue to grow in the control_ca run but not in the updated run, this was atmosphere only C96.

pjpegion commented 2 years ago

In my unit test of the stochastic physics code. I run 24,000 time steps. In the version with the memory leak, The amount of memory dropped from 85 GB at the start of my test, and down to 4.7 GB after 24,000 time-steps. The code fix shows the amount of free memory steady at 85 GB. Also, it produces identical results.

junwang-noaa commented 2 years ago

Phil, thanks for the results.

On Thu, Aug 19, 2021 at 1:41 PM Phil Pegion @.***> wrote:

In my unit test of the stochastic physics code. I run 24,000 time steps. In the version with the memory leak, The amount of memory dropped from 85 GB at the start of my test, and down to 4.7 GB after 24,000 time-steps. The code fix shows the amount of free memory steady at 85 GB. Also, it produces identical results.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902112824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TO4OHQ66FJ474CMSLLT5U62PANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

lisa-bengtsson commented 2 years ago

@junwang-noaa it timed out after 17 days, the run directories are on Hera for the control ca:

/scratch2/BMC/rem/Lisa.Bengtsson/stmp2/Lisa.Bengtsson/FV3_RT/CA_UNIT_TEST/control_ca

and the updated run: /scratch2/BMC/rem/Lisa.Bengtsson/stmp2/Lisa.Bengtsson/FV3_RT/CA_UNIT_TEST/control_ca_memory

junwang-noaa commented 2 years ago

The results look good, thanks Lisa.

On Thu, Aug 19, 2021 at 1:51 PM lisa-bengtsson @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa it timed out after 17 days, the run directories are on Hera for the control ca:

/scratch2/BMC/rem/Lisa.Bengtsson/stmp2/Lisa.Bengtsson/FV3_RT/CA_UNIT_TEST/control_ca

and the updated run:

/scratch2/BMC/rem/Lisa.Bengtsson/stmp2/Lisa.Bengtsson/FV3_RT/CA_UNIT_TEST/control_ca_memory

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/746#issuecomment-902118965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TPLI5JJ73WNAHTYRJ3T5VAADANCNFSM5CCTWO5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

ufs-community / ufs-weather-model

UFS P7c memory issue #746

Description

To Reproduce:

Additional context

Output