metno / emep-ctm

Open Source EMEP/MSC-W model
GNU General Public License v3.0
29 stars 19 forks source link

Maximum Runtime? #64

Open douglowe opened 5 years ago

douglowe commented 5 years ago

I'm trying to run EMEP for a whole year - however I'm finding that my simulations silently fail after exactly 150 days (or 3600 hours), regardless of what my start date is. I can't find any obvious namelist option to control this, could you tell me if there is any way to change this behaviour, or should I run the model individually for each month instead, to avoid this problem?

gitpeterwind commented 5 years ago

We do not have any parameter that directly control how much time has passed. Could it be that some other limitation of your system is reached (cpu time or disk space)?

douglowe commented 5 years ago

Okay - I'm not missing an obvious namelist option then.

Disk space we're fine for, and the (relative) point at which the simulation stops is exactly the same each time I've tested this (and I've used more cpu time for other simulations). I don't think it is anything external to EMEP, especially as there's no error messages thrown when it stops. This is all I get in the log file: current date and time: 2017-10-28 17:20:00 current date and time: 2017-10-28 17:40:00 Nest:write data domain1_output/EMEP_OUT_20171028.nc current date and time: 2017-10-28 18:00:00 Warning: not reading all levels 4 1 SMOIS Warning: not reading all levels 4 3 SMOIS current date and time: 2017-10-28 18:20:00 current date and time: 2017-10-28 18:40:00 current date and time: 2017-10-28 19:00:00 current date and time: 2017-10-28 19:20:00 current date and time: 2017-10-28 19:40:00 current date and time: 2017-10-28 20:00:00 current date and time: 2017-10-28 20:20:00 current date and time: 2017-10-28 20:40:00 Nest:write data domain1_output/EMEP_OUT_20171028.nc current date and time: 2017-10-28 21:00:00 MetRd:reading ../wrf_meteo/EMEP_grid/wrfout_d01_2017-10-29_00 Warning: not reading all levels 4 1 SMOIS Warning: not reading all levels 4 3 SMOIS

Are there any namelist debug flags would you recommend I switch on, to see if I can get more diagnostics for this?

gitpeterwind commented 5 years ago

It is not easy to test when this is happening so far out in the run. Maybe @avaldebe has some ideas? It may be useful to compile with traceback options, so that we get the position of the code that is failing. For Intel we use those extensive flags: -check all -check noarg_temp_created -debug-parameters all -traceback -ftrapuv -g -fpe0 -fp-stack-check (PS: I will anyway not have time to spend time on this before October)

douglowe commented 5 years ago

Sure thing - I'll try compiling with traceback flags, and will let you know if this throws up anything (and will work around the limitation for the moment).

avaldebe commented 5 years ago

Hi @douglowe

Please take into account that the debugging flags will make the code run considerably slower. If you problem is CPU time, you should crash earlier on the simulation.

gitpeterwind commented 5 years ago

(that is why I did not include the -O0 flag ! That should still leave the traceback)

douglowe commented 5 years ago

sure thing - I'll take the change in compute time when I'm checking the results.

gitpeterwind commented 5 years ago

Did you find out what went wrong?

douglowe commented 5 years ago

Not yet, sorry - I've been busy with other projects since raising this issue. I should have time in October to investigate, and will let you know if I find anything.

avaldebe commented 4 years ago

@douglowe

Did you find the problem?

douglowe commented 4 years ago

I didn't, sorry. Decided in the end to run EMEP in 2 month chunks (makes more sense operationally anyway, as we can then parallise the work more).

avaldebe commented 4 years ago

I used to do something similar on an older version of the model. The data assimilation modules were not as well tuned as they are on our current development versions and a whole year run did not fit on the HPC queue.

At the time, I had to take care to run the chunks sequentially and create a restart file for the next chunk. Otherwise PM could be too low at the begging of each chunk.

mifads commented 3 years ago

I just saw this. Very strange problem that I have never seen or heard of before. @mvieno often runs WRF+EMEP, I think for a year at a time?

mvieno commented 3 years ago

I just saw this. Very strange problem that I have never seen or heard of before. @mvieno often runs WRF+EMEP, I think for a year at a time?

Yes, I routinely run EMEP-WRF for a year. This for any domains, form global to regional. But I use a single EMEP_OUT for the full year. The only time I had a similar issue in my case was related to the NetCDF compiled without the large file feature switched on.

douglowe commented 3 years ago

Ahh - I had not checked the NetCDF library compilation settings. I'll have a look at these, see if that might have been an issue.

douglowe commented 2 years ago

Revisiting this question - I'm now finding that my EMEP simulations for one domain fail after ~50 days. Previously I would run this domain for 2 months (+ 7 days spin-up). However, I changed the WRF output files I use to drive EMEP to be hourly, not 3 hourly, and now I always hit the memory limit (~192Gb) for the HPC node I'm running on. Looking back at the memory usage for my previous runs driven by 3-hourly data I can see that the memory usage only gets up to ~90Gb at the end of the 2 months. And polling of the memory usage during a simulation shows that it is increasing as I go through the simulation.

I'm running EMEP release 3.44, with some local fixes for reading emission sectors (https://github.com/UoMResearchIT/emep-ctm/tree/source_UoM_CSF3_gfortran). We are reading a lot of emission data (using the UK NAEI dataset), so perhaps this is a contributing factor to the size of the memory usage (and maybe why you've not encountered this problem before)? Do you have any suggestions on how I might solve this problem? Is the best answer to use restart files and just run 1 month at a time?

gitpeterwind commented 2 years ago

Yes, this is a problem we also noticed. In early releases we opened and closed the NetCDF file for each variable; that was very slow. So we kept the file open until all variables are read. That was fast, but on some systems that caused a huge amount of memory being used when many variables (~1000) were read. In the latest release we therefore close and reopen the file for every 200 read(!) Primitive, but efficient. (subroutine Emis_GetCdf in EmisGet_mod.f90)

douglowe commented 2 years ago

Thanks for the suggestion - that sounds like a good solution to me. I'll have a look at the code in the latest release, and will backport this solution to my working copy.