metno / emep-ctm

Open Source EMEP/MSC-W model
GNU General Public License v3.0
27 stars 18 forks source link

Landuse_ml problems #29

Closed JohnJohanssonChalmers closed 5 years ago

JohnJohanssonChalmers commented 6 years ago

I seem to have no problem running EMEP in 1, 2 or 3 process (using for instance mpiexec -np 3 Unimod), but when I try to use 4 processes or more, EMEP crashes with the following output:

 InitLanduse: nFluxVegs=            3
 Inputs.Landuse not found
 InitLanduse: Into CDF 
 RdLanduseCDF: Starting           2           1
 MapFile /misc/orsbackup/backup/Photosmog_China/modeling/simulations/EMEP/input/EMEP_MSC-W_model.rv4.15.OpenSource/input/Landuse_PS_5km_LC.nc
 MapFile /misc/orsbackup/backup/Photosmog_China/modeling/simulations/EMEP/input/EMEP_MSC-W_model.rv4.15.OpenSource/input/glc2000mCLM.nc
RdLanduseCDF:LANDUSE: found  1 .../Landuse_PS_5km_LC.nc
RdLanduseCDF:LANDUSE: found  2 .../glc2000mCLM.nc
 LandDefs DONE           33                       0.99989318987354636       -9.9000000000000000E+019
 CDFLAND_CODES:           32  :
CF                  DF                  NF                  BF                  TC                  
MC                  RC                  SNL                 GR                  MS                  
WE                  TU                  DE                  W                   ICE                 
U                   BARE                NDLF_EVGN_TMPT_TREE NDLF_EVGN_BORL_TREE NDLF_DECD_BORL_TREE 
BDLF_EVGN_TROP_TREE BDLF_EVGN_TMPT_TREE BDLF_DECD_TROP_TREE BDLF_DECD_TMPT_TREE BDLF_DECD_BORL_TREE 
BDLF_EVGN_SHRB      BDLF_DECD_TMPT_SHRB BDLF_DECD_BORL_SHRB C3_ARCT_GRSS        C3_NARC_GRSS        
C4_GRSS             CROP                
       RdLanduseCDF: SumFrac Error    0   1  17   1  17      0.0000  75  46   1   1  75  46  15.00  82.75
 lat/lon:       RdLanduseCDF: SumFrac Error    0   1  17   1  17      0.0000  75  46   1   1  75  46  15.00  82.75   15.000000000000000        82.750000000000000     
 STOP-ALL ERROR:       RdLanduseCDF: SumFrac Error    0   1  17   1  17      0.0000  75  46   1   1  75  46  15.00  82.75
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Any idea what is causing this?

avaldebe commented 6 years ago

Is this rv4_15? What is the size of your domain?

avaldebe commented 6 years ago

@gitpeterwind

The error message comes from Landuse_ml.f90:L700

          if (  sumfrac < 0.99 .or. sumfrac > 1.01 ) then
               write(unit=errmsg,fmt="(a34,5i4,f12.4,6i4,2f7.2)") & !nb len(dtxt)=13
                 dtxt//" SumFrac Error ", me,i,j,  &
                    i_fdom(i),j_fdom(j), sumfrac, limax,  ljmax, &
                       i_fdom(1), j_fdom(1), i_fdom(limax), j_fdom(ljmax), &
                         glat(i,j), glon(i,j)
               print *, trim(errmsg)

               if(abs(sumfrac-1.0)<0.2.and.abs(glat(i,j))>89.0)then
                  write(*,*)'WARNING: ',trim(errmsg),sumfrac,glat(i,j)
               else
                   write(*,*)'lat/lon: ',trim(errmsg),glat(i,j), glon(i,j)
                 call CheckStop(errmsg)
               end if
          end if

sumfrac is derived from landuse_in on the lines previous to the error message

           do lu = 1, NLand_codes
              if ( landuse_in(i,j,lu) > 0.0 ) then

                 call GridAllocate("LANDUSE",i,j,lu,NLUMAX, &
                         index_lu, maxlufound, landuse_codes, landuse_ncodes)

                     landuse_data(i,j,index_lu) = &
                       landuse_data(i,j,index_lu) + 0.01 * landuse_in(i,j,lu)
               end if
               if ( DEBUG%LANDUSE>0 .and. dbgij )  &
                       write(*,"(a15,i3,f8.4,a10,i3,f8.4)") "DEBUG Landuse ",&
                          lu, landuse_in(i,j,lu), &
                           "index_lu ", index_lu, landuse_data(i,j,index_lu)
           end do ! lu
          LandCover(i,j)%ncodes  = landuse_ncodes(i,j)
          LandCover(i,j)%codes(:) = landuse_codes(i,j,:)
          LandCover(i,j)%fraction(:)  = landuse_data(i,j,:)
          sumfrac = sum( LandCover(i,j)%fraction(:) )

landuse_in is calculated from landuse_glob on Landuse_ml.f90:L600

            if(landuse_tot(i,j)< 0.99999 ) then
              landuse_in(i,j,:)= 0.0  ! Will overwrite all PS stuff
              dbgsum = 0.0

              do ilu = 1, NLand_codes
                landuse_in(i,j,ilu) = min(1.0, landuse_glob(i,j,ilu) )
                dbgsum = dbgsum + landuse_in(i,j,ilu)
                if ( dbgij ) then
                   write(*, "(a,i3,3es15.6,1x,a)") "F4 ", ilu, &
                      landuse_in(debug_li,debug_lj,ilu), &
                      landuse_tot(debug_li,debug_lj), dbgsum,&
                      trim(Land_Codes(ilu))
                end if
              end do

            end if ! land_tot<0.9999

This looks to me like an error on the interpolation routine that reads landuse_glob on Landuse_ml.f90:L558.

          call ReadField_CDF(trim(fName),varname,& 
               landuse_tmp,1,interpol='conservative', &
               needed=.true.,debug_flag=.false.,UnDef=-9.9E19) 

          if ( ifile == 1 ) then
               landuse_in(:,:,lu) = landuse_tmp
               landuse_tot(:,:) = landuse_tot(:,:) + landuse_tmp
          else
               landuse_glob(:,:,lu) = landuse_tmp ! will merge below
          end if

What do you think?

gitpeterwind commented 6 years ago

The interpolation routines that interpolates from lonlat grid to lonlat grid, are very complex. That means there is always the possibility that some corner case is not correctly handled. However we have had several cases with "landuse sumfrac errors" which all were due to wrong inputs. Anyway it would be necessary to reproduce the error before being able to trace this back. We would need: one day of metdata and the configfile used.

mifads commented 6 years ago

Hi Folks, just back from the dentist but working home ... I can probably help here. John, can you point me to the directory being used, and if I can't see what's wrong I can upload your settings/files to the Norwegian systems for a closer look. Thanks.

JohnJohanssonChalmers commented 6 years ago

You can download my configfile and one day of metdata from here: https://chalmersuniversity.box.com/s/eesp6kx6t5wmfar6g9zjohnz9iv67wwk

JohnJohanssonChalmers commented 6 years ago

Hi Dave! So you were also at the dentist this morning? I'll send you an email with the directory paths on jacinth.

mifads commented 6 years ago

A general issue is that we shouldn't need the European data at all for an Asia run, and I can try to check that out (the current code was hacked together before the summer, but should be cleaned and improved anyway). Why the code works starts to fail with 4 or more processors sounds like a ReadField corner case, as suggested above.

gitpeterwind commented 6 years ago

I tried with the metdata and the settings from John, and run on 4 processors without problems. I noticed that in your output

LandDefs DONE           33                       0.99989318987354636       -9.9000000000000000E+019

the last number shows that it has not been attempted overwritten by global data (I get zero there).

Somewhat one of the two "if" failed:

 if  ( EuroFileFound .and. GlobFileFound ) then ! we need to merge
 if(landuse_tot(i,j)< 0.99999 ) then

I do not think it is an interpolation issue. Dave, if you can reproduce the error on Stallo I can find out. But maybe it would be better to try on "jacinth".

JohnJohanssonChalmers commented 6 years ago

Is this rv4_15? What is the size of your domain?

Yes, it's rv4_15. The domain size in this case was 150x93.

mifads commented 6 years ago

Hi @gitpeterwind , I'll check on stallo or vilje first since I haven't actually used the Chalmers computers for runs yet. I am puzzled as to why John can run with 1-3 processors, but not 4 or more, but I'll start by re-checking the logic of that EuroFleFound stuff.

mifads commented 6 years ago

Hi again @gitpeterwind where is that stallo test? I just realised that my usual run.pl settings won't work for John's China domain, so I assume you have used some modrun.sh type setup?

gitpeterwind commented 6 years ago

The piece of code above with "EuroFileFound" was actually taken from an older version. The problem is either

 if(landuse_tot(i,j)< 0.99999 ) then

or landuse_glob which is wrong in

              do ilu = 1, NLand_codes
                landuse_in(i,j,ilu) = min(1.0, landuse_glob(i,j,ilu) )
gitpeterwind commented 6 years ago

~mifapw/emep/emep-mscw/run.pl

The important part is in config:

  meteo     = '/global/work/mifapw/isue29/wrfout_d01_2016-03-25_00_00_00',
gitpeterwind commented 6 years ago

(and "isue29" is not a typo, but if you write issue with two "s", it will be replaced by 00... another issue!)

JohnJohanssonChalmers commented 6 years ago

As per Dave's suggestion, I tried to exclude the European data, by changing:

  LandCoverInputs%MapFile   = 'DataDir/Landuse_PS_5km_LC.nc',
                              'DataDir/glc2000mCLM.nc',

to:

  LandCoverInputs%MapFile   = 'DataDir/glc2000mCLM.nc',

Now I get a totally different error that doesn't depend on the number of processes:

 InitLanduse: nFluxVegs=            3
 Inputs.Landuse not found
 InitLanduse: Into CDF 
 RdLanduseCDF: Starting           2           1
 MapFile /misc/orsbackup/backup/Photosmog_China/modeling/simulations/EMEP/input/EMEP_MSC-W_model.rv4.15.OpenSource/input/glc2000mCLM.nc
 MapFile NOTSET
RdLanduseCDF:LANDUSE: found  1 .../glc2000mCLM.nc
 STOP-ALL ERROR: RdLanduseCDF:LANDUSE: NOT found NOTSET
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Does this mean that EMEP can't find glc2000mCLM.nc? That's strange.

I also figured maybe there always has to be two landuse files, so I tried this:

  LandCoverInputs%MapFile   = 'DataDir/glc2000mCLM.nc',
                              'DataDir/glc2000mCLM.nc',

Now, I got this error instead (also independent on the number of processes):

Deriv:MISC SURF_ppbC_VOCYMD   VOC
 Wet deposition output: WDEP_PREC ug/m3
 Wet deposition output: WDEP_SOX mgS/m2
 Wet deposition output: WDEP_OXN mgN/m2
 Wet deposition output: WDEP_RDN mgN/m2
 Wet deposition output: WDEP_SO2 mgS/m2
 Wet deposition output: WDEP_HNO3 mgN/m2
 Derived VOC setup returns           68 vocs
    indices 
  6 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
 30 31 32 34 35 37 38 39 40 45 46 47 48 50 51 52 53 54 55 56
 57 58 68 69 70 71 72 73 74 75 76 79 80 82 84 85 86 87 88 89
 95 96 97 98 99105106107
    carbons 
  2  2  2  3  5  4  1  2  2  4  2  3  8  5 10 10 10 10  1  2
  4  2  3  4  5  2  1  2  3  5  4  4  4  1  5  5  1  4  5  5
  5  4  1  1  1  1  1  1  1  1  1  1  1  1 14  1  1  1  1  1
  1  1  1  1  1  1  1  1
 SOILNOX ispec            2

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7FD46F72DE08
#1  0x7FD46F72CF90
#2  0x7FD46EC1F4AF
#3  0x586D70 in __netcdf_ml_MOD_readfield_cdf
#4  0x43D23D in __biogenics_ml_MOD_geteurobvoc
#5  0x43EC35 in __biogenics_ml_MOD_init_bvoc
#6  0x613B00 in MAIN__ at Unimod.f90:?

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
gitpeterwind commented 6 years ago

The code is still trying to find 2 MapFiles. No syntax problem in config file?

|RdLanduseCDF: Starting 2 1|

The "2" means that it is looking for two MapFile

Edit:

It seems it is not enough to define one file. But you could try to simply link twice to the glc file

mifads commented 6 years ago

I have the wrf meteo running on vilje now (though had to modify $GRID to GLOB, DEGREE_DAY_FACTORS to F, USE_WRF_MET_NAMES = T, emis etc.). I had problems with ForestFires (even after copying the 2016 data from stallo to vilje), but will come back to that. Tried 3 variations of cpu/processor:

##PBS -l select=4:ncpus=32:mpiprocs=32 -v MPI_MSGS_MAX=2097152,MPI_BUFS_PER_PROC=2048
##PBS -l select=1:ncpus=16:mpiprocs=16 -v MPI_MSGS_MAX=2097152,MPI_BUFS_PER_PROC=2048
#PBS -l select=1:ncpus=4:mpiprocs=4 -v MPI_MSGS_MAX=2097152,MPI_BUFS_PER_PROC=2048

and all worked, so I can't reproduce John's error there. Tomorrow I'll be in Chalmers and we can take a closer look.

JohnJohanssonChalmers commented 6 years ago

It seems it is not enough to define one file. But you could try to simply link twice to the glc file

Yes, that is what I did, but then I ran into some other problems. Look at the end of my last post.

mifads commented 6 years ago

Hi @JohnJohanssonChalmers

Change Landuse_ml :

1) By lines 483, ..

    landuse_in  = 0.0              !***  initialise  ***
    landuse_glob  = 0.0              !***  initialise  ***

add also landuse_tot = 0.0

2) and by ca. line 562, change:

        if ( ifile == 1 ) then
               landuse_in(:,:,lu) = landuse_tmp
               landuse_tot(:,:) = landuse_tot(:,:) + landuse_tmp

to be

        if ( ifile == 1 ) then
           where (landuse_tmp>0.0)    !Oct2017
               landuse_in(:,:,lu) = landuse_tmp
               landuse_tot(:,:) = landuse_tot(:,:) + landuse_tmp
           end where  !Oct2017

The first change is a simple initialisation that should always be done. The second is needed since the file-1 data might be undefined for the modelling area (or individual cells), as for example when running in Asia but file-1 is European. With the where statement we simply ensure that no attempt is made to use the data, and the file-2 global data will be used instead.

The code still needs improvement, but try the above.

mifads commented 6 years ago

Hi @gitpeterwind @avaldebe I just git pushed the above Landuse_ml changes to dev.

JohnJohanssonChalmers commented 6 years ago

Great Dave! This seems to have solved it. I can now run on as many processors as I like.

But I still need to include the European landuse file to make the it work. Specifying only glc2000mCLM.nc or giving the same file twice, still gives errors as described above. Maybe this is something you want to look into too.

Also, just to check: The simulations that I did before this fix (using less than 4 processors) should still be ok, right? There's no reason to suspect that this bug caused any silent errors in the results?

mifads commented 6 years ago

Hi John, good we solved one problem anyway! Yeh, the code expects both files; that was just part of the hack done months ago. I need to re-write that one day, but probably not this week. About the earlier simulations, then I am not sure. There is a danger that things will change (it is always hard to know with initialisation issues).

mifads commented 6 years ago

I just changed the heading so people don't think that the model has general problems with many processors. This particular case was an Asian domain running off WRF meteorology. /Dave

JohnJohanssonChalmers commented 6 years ago

It's ok that you changed the heading, but I don't think this bug was THAT specific. I had similar problems earlier when running the code for a European domain stretching just slightly outside the grid of the European landuse file. And that problem was not limited to running on multiple processes. My solution was then to use RUNDOMAIN to trim the domain to fit inside the European landuse grid.

Missing to initialize to 0 can be really tricky bugs, because a lot of the time memory will be all zeros anyway. You never know what might cause the bug to suddenly appear.

mifads commented 6 years ago

OK, point taken. The new title is now very general ;-) So far we haven't seen any sign of problems when running with the EECCA domain, likely as the 'PS' landcover map fills this space completely. Still, I am not 100% sure that the code was safe, and initialisation should have been done.

gitpeterwind commented 6 years ago

It seems it is not enough to define one file. But you could try to simply link twice to the glc file

Yes, that is what I did, but then I ran into some other problems. Look at the end of my last post.

The GetEuroBVOC routine in Biogenics_ml.f90 needed CF,DF,NF and BF to be defined. Those are defined by the Landuse_PS_5km_LC.nc To correct for this, you can avoid the error by writing (line 289):

ibvoc = find_index( VegName(iveg), LandDefs(:)%code )
if( ibvoc<0 ) cycle
avaldebe commented 6 years ago

This issue is label as solved. Why is till open?

gitpeterwind commented 6 years ago

There are still small problems: you cannot specify only one landuse for example. Also the glc2000 should be updated with glc2015 (Dave is working on it)

gitpeterwind commented 6 years ago

And while commenting landuse improvements: The Landuse_PS_5km_LC.nc takes a long time to read for fine resolutions. It is slow also when the rundomain is covering a region outside Europe. A simple test could accelerate this. (A temporary fix is to specify the glc2000 twice in the config file)

avaldebe commented 5 years ago

As far as I can tell, this was addressed on rv4_32. Please reopen if necesary.