Closed JohnJohanssonChalmers closed 5 years ago
Is this rv4_15
? What is the size of your domain?
@gitpeterwind
The error message comes from Landuse_ml.f90:L700
if ( sumfrac < 0.99 .or. sumfrac > 1.01 ) then
write(unit=errmsg,fmt="(a34,5i4,f12.4,6i4,2f7.2)") & !nb len(dtxt)=13
dtxt//" SumFrac Error ", me,i,j, &
i_fdom(i),j_fdom(j), sumfrac, limax, ljmax, &
i_fdom(1), j_fdom(1), i_fdom(limax), j_fdom(ljmax), &
glat(i,j), glon(i,j)
print *, trim(errmsg)
if(abs(sumfrac-1.0)<0.2.and.abs(glat(i,j))>89.0)then
write(*,*)'WARNING: ',trim(errmsg),sumfrac,glat(i,j)
else
write(*,*)'lat/lon: ',trim(errmsg),glat(i,j), glon(i,j)
call CheckStop(errmsg)
end if
end if
sumfrac
is derived from landuse_in
on the lines previous to the error message
do lu = 1, NLand_codes
if ( landuse_in(i,j,lu) > 0.0 ) then
call GridAllocate("LANDUSE",i,j,lu,NLUMAX, &
index_lu, maxlufound, landuse_codes, landuse_ncodes)
landuse_data(i,j,index_lu) = &
landuse_data(i,j,index_lu) + 0.01 * landuse_in(i,j,lu)
end if
if ( DEBUG%LANDUSE>0 .and. dbgij ) &
write(*,"(a15,i3,f8.4,a10,i3,f8.4)") "DEBUG Landuse ",&
lu, landuse_in(i,j,lu), &
"index_lu ", index_lu, landuse_data(i,j,index_lu)
end do ! lu
LandCover(i,j)%ncodes = landuse_ncodes(i,j)
LandCover(i,j)%codes(:) = landuse_codes(i,j,:)
LandCover(i,j)%fraction(:) = landuse_data(i,j,:)
sumfrac = sum( LandCover(i,j)%fraction(:) )
landuse_in
is calculated from landuse_glob
on Landuse_ml.f90:L600
if(landuse_tot(i,j)< 0.99999 ) then
landuse_in(i,j,:)= 0.0 ! Will overwrite all PS stuff
dbgsum = 0.0
do ilu = 1, NLand_codes
landuse_in(i,j,ilu) = min(1.0, landuse_glob(i,j,ilu) )
dbgsum = dbgsum + landuse_in(i,j,ilu)
if ( dbgij ) then
write(*, "(a,i3,3es15.6,1x,a)") "F4 ", ilu, &
landuse_in(debug_li,debug_lj,ilu), &
landuse_tot(debug_li,debug_lj), dbgsum,&
trim(Land_Codes(ilu))
end if
end do
end if ! land_tot<0.9999
This looks to me like an error on the interpolation routine that reads landuse_glob
on Landuse_ml.f90:L558.
call ReadField_CDF(trim(fName),varname,&
landuse_tmp,1,interpol='conservative', &
needed=.true.,debug_flag=.false.,UnDef=-9.9E19)
if ( ifile == 1 ) then
landuse_in(:,:,lu) = landuse_tmp
landuse_tot(:,:) = landuse_tot(:,:) + landuse_tmp
else
landuse_glob(:,:,lu) = landuse_tmp ! will merge below
end if
What do you think?
The interpolation routines that interpolates from lonlat grid to lonlat grid, are very complex. That means there is always the possibility that some corner case is not correctly handled. However we have had several cases with "landuse sumfrac errors" which all were due to wrong inputs. Anyway it would be necessary to reproduce the error before being able to trace this back. We would need: one day of metdata and the configfile used.
Hi Folks, just back from the dentist but working home ... I can probably help here. John, can you point me to the directory being used, and if I can't see what's wrong I can upload your settings/files to the Norwegian systems for a closer look. Thanks.
You can download my configfile and one day of metdata from here: https://chalmersuniversity.box.com/s/eesp6kx6t5wmfar6g9zjohnz9iv67wwk
Hi Dave! So you were also at the dentist this morning? I'll send you an email with the directory paths on jacinth.
A general issue is that we shouldn't need the European data at all for an Asia run, and I can try to check that out (the current code was hacked together before the summer, but should be cleaned and improved anyway). Why the code works starts to fail with 4 or more processors sounds like a ReadField corner case, as suggested above.
I tried with the metdata and the settings from John, and run on 4 processors without problems. I noticed that in your output
LandDefs DONE 33 0.99989318987354636 -9.9000000000000000E+019
the last number shows that it has not been attempted overwritten by global data (I get zero there).
Somewhat one of the two "if" failed:
if ( EuroFileFound .and. GlobFileFound ) then ! we need to merge
if(landuse_tot(i,j)< 0.99999 ) then
I do not think it is an interpolation issue. Dave, if you can reproduce the error on Stallo I can find out. But maybe it would be better to try on "jacinth".
Is this rv4_15? What is the size of your domain?
Yes, it's rv4_15. The domain size in this case was 150x93.
Hi @gitpeterwind , I'll check on stallo or vilje first since I haven't actually used the Chalmers computers for runs yet. I am puzzled as to why John can run with 1-3 processors, but not 4 or more, but I'll start by re-checking the logic of that EuroFleFound stuff.
Hi again @gitpeterwind where is that stallo test? I just realised that my usual run.pl settings won't work for John's China domain, so I assume you have used some modrun.sh type setup?
The piece of code above with "EuroFileFound" was actually taken from an older version. The problem is either
if(landuse_tot(i,j)< 0.99999 ) then
or landuse_glob which is wrong in
do ilu = 1, NLand_codes
landuse_in(i,j,ilu) = min(1.0, landuse_glob(i,j,ilu) )
~mifapw/emep/emep-mscw/run.pl
The important part is in config:
meteo = '/global/work/mifapw/isue29/wrfout_d01_2016-03-25_00_00_00',
(and "isue29" is not a typo, but if you write issue with two "s", it will be replaced by 00... another issue!)
As per Dave's suggestion, I tried to exclude the European data, by changing:
LandCoverInputs%MapFile = 'DataDir/Landuse_PS_5km_LC.nc',
'DataDir/glc2000mCLM.nc',
to:
LandCoverInputs%MapFile = 'DataDir/glc2000mCLM.nc',
Now I get a totally different error that doesn't depend on the number of processes:
InitLanduse: nFluxVegs= 3
Inputs.Landuse not found
InitLanduse: Into CDF
RdLanduseCDF: Starting 2 1
MapFile /misc/orsbackup/backup/Photosmog_China/modeling/simulations/EMEP/input/EMEP_MSC-W_model.rv4.15.OpenSource/input/glc2000mCLM.nc
MapFile NOTSET
RdLanduseCDF:LANDUSE: found 1 .../glc2000mCLM.nc
STOP-ALL ERROR: RdLanduseCDF:LANDUSE: NOT found NOTSET
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Does this mean that EMEP can't find glc2000mCLM.nc
? That's strange.
I also figured maybe there always has to be two landuse files, so I tried this:
LandCoverInputs%MapFile = 'DataDir/glc2000mCLM.nc',
'DataDir/glc2000mCLM.nc',
Now, I got this error instead (also independent on the number of processes):
Deriv:MISC SURF_ppbC_VOCYMD VOC
Wet deposition output: WDEP_PREC ug/m3
Wet deposition output: WDEP_SOX mgS/m2
Wet deposition output: WDEP_OXN mgN/m2
Wet deposition output: WDEP_RDN mgN/m2
Wet deposition output: WDEP_SO2 mgS/m2
Wet deposition output: WDEP_HNO3 mgN/m2
Derived VOC setup returns 68 vocs
indices
6 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32 34 35 37 38 39 40 45 46 47 48 50 51 52 53 54 55 56
57 58 68 69 70 71 72 73 74 75 76 79 80 82 84 85 86 87 88 89
95 96 97 98 99105106107
carbons
2 2 2 3 5 4 1 2 2 4 2 3 8 5 10 10 10 10 1 2
4 2 3 4 5 2 1 2 3 5 4 4 4 1 5 5 1 4 5 5
5 4 1 1 1 1 1 1 1 1 1 1 1 1 14 1 1 1 1 1
1 1 1 1 1 1 1 1
SOILNOX ispec 2
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7FD46F72DE08
#1 0x7FD46F72CF90
#2 0x7FD46EC1F4AF
#3 0x586D70 in __netcdf_ml_MOD_readfield_cdf
#4 0x43D23D in __biogenics_ml_MOD_geteurobvoc
#5 0x43EC35 in __biogenics_ml_MOD_init_bvoc
#6 0x613B00 in MAIN__ at Unimod.f90:?
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
The code is still trying to find 2 MapFiles. No syntax problem in config file?
|RdLanduseCDF: Starting 2 1|
The "2" means that it is looking for two MapFile
Edit:
It seems it is not enough to define one file. But you could try to simply link twice to the glc file
I have the wrf meteo running on vilje now (though had to modify $GRID to GLOB, DEGREE_DAY_FACTORS to F, USE_WRF_MET_NAMES = T, emis etc.). I had problems with ForestFires (even after copying the 2016 data from stallo to vilje), but will come back to that. Tried 3 variations of cpu/processor:
##PBS -l select=4:ncpus=32:mpiprocs=32 -v MPI_MSGS_MAX=2097152,MPI_BUFS_PER_PROC=2048
##PBS -l select=1:ncpus=16:mpiprocs=16 -v MPI_MSGS_MAX=2097152,MPI_BUFS_PER_PROC=2048
#PBS -l select=1:ncpus=4:mpiprocs=4 -v MPI_MSGS_MAX=2097152,MPI_BUFS_PER_PROC=2048
and all worked, so I can't reproduce John's error there. Tomorrow I'll be in Chalmers and we can take a closer look.
It seems it is not enough to define one file. But you could try to simply link twice to the glc file
Yes, that is what I did, but then I ran into some other problems. Look at the end of my last post.
Hi @JohnJohanssonChalmers
Change Landuse_ml :
1) By lines 483, ..
landuse_in = 0.0 !*** initialise ***
landuse_glob = 0.0 !*** initialise ***
add also landuse_tot = 0.0
2) and by ca. line 562, change:
if ( ifile == 1 ) then
landuse_in(:,:,lu) = landuse_tmp
landuse_tot(:,:) = landuse_tot(:,:) + landuse_tmp
to be
if ( ifile == 1 ) then
where (landuse_tmp>0.0) !Oct2017
landuse_in(:,:,lu) = landuse_tmp
landuse_tot(:,:) = landuse_tot(:,:) + landuse_tmp
end where !Oct2017
The first change is a simple initialisation that should always be done. The second is needed since the file-1 data might be undefined for the modelling area (or individual cells), as for example when running in Asia but file-1 is European. With the where statement we simply ensure that no attempt is made to use the data, and the file-2 global data will be used instead.
The code still needs improvement, but try the above.
Hi @gitpeterwind @avaldebe I just git pushed the above Landuse_ml changes to dev.
Great Dave! This seems to have solved it. I can now run on as many processors as I like.
But I still need to include the European landuse file to make the it work. Specifying only glc2000mCLM.nc
or giving the same file twice, still gives errors as described above. Maybe this is something you want to look into too.
Also, just to check: The simulations that I did before this fix (using less than 4 processors) should still be ok, right? There's no reason to suspect that this bug caused any silent errors in the results?
Hi John, good we solved one problem anyway! Yeh, the code expects both files; that was just part of the hack done months ago. I need to re-write that one day, but probably not this week. About the earlier simulations, then I am not sure. There is a danger that things will change (it is always hard to know with initialisation issues).
I just changed the heading so people don't think that the model has general problems with many processors. This particular case was an Asian domain running off WRF meteorology. /Dave
It's ok that you changed the heading, but I don't think this bug was THAT specific. I had similar problems earlier when running the code for a European domain stretching just slightly outside the grid of the European landuse file. And that problem was not limited to running on multiple processes. My solution was then to use RUNDOMAIN to trim the domain to fit inside the European landuse grid.
Missing to initialize to 0 can be really tricky bugs, because a lot of the time memory will be all zeros anyway. You never know what might cause the bug to suddenly appear.
OK, point taken. The new title is now very general ;-) So far we haven't seen any sign of problems when running with the EECCA domain, likely as the 'PS' landcover map fills this space completely. Still, I am not 100% sure that the code was safe, and initialisation should have been done.
It seems it is not enough to define one file. But you could try to simply link twice to the glc file
Yes, that is what I did, but then I ran into some other problems. Look at the end of my last post.
The GetEuroBVOC routine in Biogenics_ml.f90 needed CF,DF,NF and BF to be defined. Those are defined by the Landuse_PS_5km_LC.nc To correct for this, you can avoid the error by writing (line 289):
ibvoc = find_index( VegName(iveg), LandDefs(:)%code )
if( ibvoc<0 ) cycle
This issue is label as solved. Why is till open?
There are still small problems: you cannot specify only one landuse for example. Also the glc2000 should be updated with glc2015 (Dave is working on it)
And while commenting landuse improvements: The Landuse_PS_5km_LC.nc takes a long time to read for fine resolutions. It is slow also when the rundomain is covering a region outside Europe. A simple test could accelerate this. (A temporary fix is to specify the glc2000 twice in the config file)
As far as I can tell, this was addressed on rv4_32. Please reopen if necesary.
I seem to have no problem running EMEP in 1, 2 or 3 process (using for instance
mpiexec -np 3 Unimod
), but when I try to use 4 processes or more, EMEP crashes with the following output:Any idea what is causing this?