orog program fails on MacOS with gfortran, `Illegal instruction: 4`

mkavulich commented 3 years ago

I have been trying to debug this problem for a while but have had no success. Running orog on MacOS Catalina (10.15.7) compiled with gfortran 9.3.0 (GNU Fortran (MacPorts gcc9 9.3.0_4) 9.3.0) fails with the message Illegal instruction: 4. This issue occurs with or without DEBUG settings turned on (there seems to be no difference in behavior regardless of compilation flags).

This is the end of the output leading up to failure:

  Before GICE ZAVG(1,2)=        2801           1
  Before GICE ZAVG(1,12)=        2804           1
  Before GICE ZAVG(1,52)=        2789           1
  Before GICE ZAVG(1,112)=           0           1
  GICE 30" Antarctica RAMP orog 43200x3616 read OK
  Processing! 
  Processing! 
  Processing! 
  Ocean lsm file Opened OK: mskocn,istat=           1           0
  Rd fail: Ocean lsm - continue, mskocn,ios=           0          -1
 outgrid=/Volumes/d1/workdir/UFS/one_more_macos_try/ufs-srweather-app/experiment_DEBUG_FLAGS/test_GSD_HRRR_AK_50km/fix_lam/C201_grid.tile7.halo6.nc
 Read the grid from file /Volumes/d1/workdir/UFS/one_more_macos_try/ufs-srweather-app/experiment_DEBUG_FLAGS/test_GSD_HRRR_AK_50km/fix_lam/C201_grid.tile7.halo6.nc
 minlat=   44.683594338329847      maxlat=   76.964315294499997     
 north pole supergrid index is            0           0
 south pole supergrid index is            0           0
  Timer 1 time=    55.631000000001222     
Illegal instruction: 4

Based on the output, the failure appears to be in the calling or initialization of the MAKEMT2 subroutine, but somehow I can not tease any more debugging info: the executable does not print a traceback or any other information aside from "Illegal instruction: 4".

Here is the full log, everything looks normal until the sudden failure.

I'm at a loss as to what to try next, so I'm hoping someone else can offer some suggestions, or better yet try this themselves and try to reproduce the issue. Let me know if you need any more info.

climbfuji commented 3 years ago

Did you see my google chat message yesterday? I had to fix two such errors, illegal instruction, with the same underlying cause, in the past - one in EMC_post, and another in ccpp-physics (sfcsub.F).

mkavulich commented 3 years ago

@climbfuji Thanks for the reply; I did see your suggestion yesterday and the link (https://github.com/NOAA-EMC/EMC_post/pull/81/files), but I didn't see any character definitions in the beginning of that subroutine. The error must be occurring before the print statement, so it has to be one of these lines unless I'm completely off-base:

     1 GLAT,IM,JM,IMN,JMN,lon_c,lat_c)
      implicit none
      real, parameter :: D2R = 3.14159265358979/180.
      integer, parameter :: MAXSUM=20000000
      real  hgt_1d(MAXSUM)
      integer IM, JM, IMN, JMN
      real GLAT(JMN), GLON(IMN)
      INTEGER ZAVG(IMN,JMN),ZSLM(IMN,JMN)
      real land_frac(IM,JM)
      real ORO(IM,JM),SLM(IM,JM),VAR(IM,JM),VAR4(IM,JM)
      integer IST,IEN,JST, JEN
      real lon_c(IM+1,JM+1), lat_c(IM+1,JM+1)
      INTEGER mskocn,isave
      LOGICAL FLAG, DEBUG
      real    LONO(4),LATO(4),LONI,LATI
      real    HEIGHT
      integer JM1,i,j,nsum,ii,jj,i1,numx,i2
      integer ilist(IMN)
      real    DELXN,XNSUM,XLAND,XWATR,XL1,XS1,XW1,XW2,XW4
!jaa
      real :: xnsum_j,xland_j,xwatr_j
      logical inside_a_polygon

As far as I can tell all of that is pretty standard (if very messy) fortran variable declarations. The only thing I noticed weird about the beginning of that subroutine was the use of "1" as a continuation character, but I changed it to "&" and still got the same error. (And from further reading apparently that is perfectly fine by fortran77 standards).

I'm curious if you have more info about the exact conditions that caused the errors you fixed; from what I can tell it's related to setting variables as allocatable by using "*"...is that correct?

climbfuji commented 3 years ago

@climbfuji Thanks for the reply; I did see your suggestion yesterday and the link (https://github.com/NOAA-EMC/EMC_post/pull/81/files), but I didn't see any character definitions in the beginning of that subroutine. The error must be occurring before the print statement, so it has to be one of these lines unless I'm completely off-base:
     1 GLAT,IM,JM,IMN,JMN,lon_c,lat_c)
      implicit none
      real, parameter :: D2R = 3.14159265358979/180.
      integer, parameter :: MAXSUM=20000000
      real  hgt_1d(MAXSUM)
      integer IM, JM, IMN, JMN
      real GLAT(JMN), GLON(IMN)
      INTEGER ZAVG(IMN,JMN),ZSLM(IMN,JMN)
      real land_frac(IM,JM)
      real ORO(IM,JM),SLM(IM,JM),VAR(IM,JM),VAR4(IM,JM)
      integer IST,IEN,JST, JEN
      real lon_c(IM+1,JM+1), lat_c(IM+1,JM+1)
      INTEGER mskocn,isave
      LOGICAL FLAG, DEBUG
      real    LONO(4),LATO(4),LONI,LATI
      real    HEIGHT
      integer JM1,i,j,nsum,ii,jj,i1,numx,i2
      integer ilist(IMN)
      real    DELXN,XNSUM,XLAND,XWATR,XL1,XS1,XW1,XW2,XW4
!jaa
      real :: xnsum_j,xland_j,xwatr_j
      logical inside_a_polygon
As far as I can tell all of that is pretty standard (if very messy) fortran variable declarations. The only thing I noticed weird about the beginning of that subroutine was the use of "1" as a continuation character, but I changed it to "&" and still got the same error. (And from further reading apparently that is perfectly fine by fortran77 standards).

I'm curious if you have more info about the exact conditions that caused the errors you fixed; from what I can tell it's related to setting variables as allocatable by using "*"...is that correct?

I guess it has to do with explicit dimensions instead of allocating those arrays.

Questions: Do you use ulimit -S -s unlimited on your mac? That bumps that stack up from 8MB to 65MB (maximum allowed)?

If I had to guess I'd say it is the line

real  hgt_1d(MAXSUM)

but you could try to make all these array definitions allocatable arrays, and if this fixes the problem revert one by one until it breaks again.

mkavulich commented 3 years ago

@climbfuji Great guess, changing that one line seems to have fixed the issue! I'll open a PR once I am sure there are no other areas that need fixing in UFS_UTILS.

climbfuji commented 3 years ago

@climbfuji Great guess, changing that one line seems to have fixed the issue! I'll open a PR once I am sure there are no other areas that need fixing in UFS_UTILS.

Hah! Thanks for trying. Since the original code is valid Fortran, you may want to use CPP directives to use the allocate syntax only for macOS - but best to discuss with @GeorgeGayno-NOAA .

edwardhartnett commented 3 years ago

@kgerheiser can you add a macos build to the github actions? That will be the beginning of testing this...

ufs-community / UFS_UTILS

orog program fails on MacOS with gfortran, `Illegal instruction: 4` #243