Latest version crashes after creating file to store intermediate results (both gcc and intel)

auerl commented 9 years ago

The latest version of the code crashes in master_queue.f90 after "Create file for intermediate results" at the location

call nc_putvar_by_name(ncid = ncid_intermediate, &
varname = 'K_x', &
values = real(K_x, kind=sp) )

With the error message:

ERROR: CPU 0 could not write 2D variable: 'K_x'( 1) in NCID 65536 start ( 1) + count( 7260) is larger than size ( 100) of dimension K_x_1 (1)

This happens both with triangular meshes, and voxel meshes (I haven't tried tetrahedral ones). Interestingly, the code manages to pass this place, when compiling it with the intel fortran compilers, With intel, however, the code crashes with a

*\ Error in `./kerner': double free or corruption (!prev): 0x BB5 forrtl: error (78): process killed (SIGTERM)

after "Write mesh partition and convergence to disk" in master_queue.f90

Does anyone else encounter this issue?

sstaehler commented 9 years ago

No, I did not encounter it so far, even though the intermediate results are not the most stable part of the code. Would you mind to post your inparam_basic here?

auerl commented 9 years ago

Here is my inparam_basic. I am running the code with intel 14.0.2, netcdf-4.3.3 (compiled with ifort), openmpi-1.8.4 OR gcc-4.8.2, netcdf-4.1.3, and the default gcc version of openmpi on the machine in Zurich. It worked (at least with gnu) before the latest changes in the intermediate-result parts of the code. The AxiSEM version is the one from the official repository ~mid of last week.

# directory of the forward and backward run
#FWD_DIR              './wavefield_50s/fwd/'
#BWD_DIR              './wavefield_50s/bwd/'
FWD_DIR              './wavefield_20s/fwd/'
BWD_DIR              './wavefield_20s/bwd/'

# Paths of parameter files 
SOURCE_FILE          'CMTSOLUTION'
RECEIVER_FILE        'receiver.dat'
FILTER_FILE          'filters.dat'

# Select the mesh file type. Allowed values are
# abaqus      : .inp file, can be generated with Qubit or other codes. Can
#               contain various geometries and multiple sub-objects
#               Supported geometries: tetrahedra, triangles, quadrilaterals
#               Set file name in MESH_FILE_ABAQUS
# 
# tetrahedral : tetrahedral mesh in two separate files with 
#               1. coordinates of the vertices (MESH_FILE_VERTICES)
#               2. the connectivity of the facets of the tetrahedrons
#                  (MESH_FILE_FACETS)
MESH_FILE_TYPE       'abaqus'
MESH_FILE_ABAQUS     'unit_tests/flat_triangles.inp'
#MESH_FILE_ABAQUS     'unit_tests/vox_15l_5deg_test.dat'

#MESH_FILE_TYPE       'tetrahedral'
#MESH_FILE_VERTICES   'unit_tests/vertices.TEST'
#MESH_FILE_FACETS     'unit_tests/facets.TEST'

# Prefix of output file names.
# Kernel files are called $OUTPUT_FILE_kernel.xdmf
# Wavefield movies are called $OUTPUT_FILE_wavefield.xdmf
OUTPUT_FILE          'kerner'

# Output format when dumping kernels and wavefields. 
# Choose between xdmf, Yale-style csr binary format (compressed sparse row) and
# ascii.
# Yet, the allowed error below is assumed as the truncation threshold in 
# csr and ascii storage
DUMP_TYPE            'xdmf'

# Write out seismograms? (default: true)
# Seismograms (raw full trace, filtered full trace and cut trace) can be 
# written out. Produces three files per kernel. Disable to avoid congesting 
# your rundir.
WRITE_SEISMOGRAMS    true

# Monte Carlo integration
# Absolute and relative error limits can be defined separately. The convergence
# conditions are connected by OR
# Allowed absolute error per cell
ALLOWED_ERROR        1e-4

# Allowed relative error per cell
ALLOWED_RELATIVE_ERROR 2e-2

# Number of points on which the kernel should be evaluated per MC iteration
POINTS_PER_MC_STEP   20

# Maximum number of iterations after which to cancel Monte Carlo integration 
# in one cell, regardless of error.
MAXIMUM_ITERATIONS   1 #100

# Write detailed convergence of elements (default: false)
# Every slave writes out the values of all the kernels and their respective 
# estimated errors into his OUTPUT_??? file after each MC step. This can lead 
# to huge ASCII files (>1GB) with inane line lengths (approx. 20 x nkernel).
# However, it might be interesting to study the convergence behaviour. 
# When set to false, only one summary per cell is written out.
WRITE_DETAILED_CONVERGENCE  false

# Size of buffers for strain and displacement.
#  - fullfields: only strain buffer is used for chunkwise IO
#  - displ_only: displacement buffer is used for chunkwise IO and strain buffer contains
#                the strain in the GLL basis for whole elements
STRAIN_BUFFER_SIZE   1000
DISPL_BUFFER_SIZE    100

# Number of elements in each MPI task. 
ELEMENTS_PER_TASK    10 #100

# Use quasirandom numbers instead of pseudorandom ones
USE_QUASIRANDOM_NUMBERS true

# Integration scheme
# Options:
# parseval:    FFT seismogram and convolved wavefield and use Parseval's Theorem
#              then trapezoidal rule is used in frequency domain
# trapezoidal: Use the trapezoidal rule in time domain
INTEGRATION_SCHEME  parseval

# FFTW Planning to use
# Options: 
# ESTIMATE:   Use heuristic to find best FFT plan
# MEASURE:    Compute several test FFTs to find best plan (default)
# PATIENT:    Compute a lot of test FFTs to find best plan
# EXHAUSTIVE: Compute an awful amount of test FFTs to find best plan
# for a detailed explanation: http://www.fftw.org/doc/Planner-Flags.html
FFTW_PLAN              MEASURE

# Do you want to calculate a kernel or just plot wavefields? 
# integratekernel has to be run with MPI and at least two processors
WHAT_TO_DO           'integratekernel'

# plot_wavefield has to be run in serial
#WHAT_TO_DO           'plot_wavefield'

# Do you want your kernels to be given on the vertices ('onvertices') or
# inside ('volumetric') each elements?
INT_TYPE             'volumetric'

sstaehler commented 9 years ago

Can't reproduce this problem with gcc 4.8.2 and the NetCDF version of the OS.

auerl commented 9 years ago

Hmm, strange. Can you point me to the folder on our machine where the kerner version is that works for you, as well as the wavefields you use. This should help to figure out what is going on. Cheers, L.

sstaehler commented 9 years ago

Wait, I can reproduce it now. Trying to fence it in

sstaehler commented 9 years ago

Okay, check, whether ebc5037d70cafffd97e930a66b6393e40a4caf74 fixes it.

If the rundir already contained a intermediate_results.nc, it was not deleted and it might have contained variables with different sizes.

auerl commented 9 years ago

Thanks! Now it works with gfortran! With ifort it cannot work since the compiler doesn't support "execute_command_line". On a side note, there doesn't seem to be a significant performance difference between the version of ifort I use, and gcc 4.8.2. I remember that you mentioned a large speedup when compiling with ifort on SuperMUC, and I think we interpreted this as the "large-value" issue going away when using Intel. So maybe this still needs to be taken care of via some intermittent manipulation of the numbers during calculation.

auerl commented 9 years ago

It seems that still something is messed up in the output netcdf file. Here are 20s P-wave kernels computed with the latest and last weeks version of the code (input wavefields, inparam file and compiler versions are the same):

kernel_int kernel_new

sstaehler commented 9 years ago

Yes, that is an unrelated issue that I noticed with Kasra last week. I was hoping that it was just a problem of his settings, but it is more general. You may have also noticed that there is also only one kernel shown in paraview.

The cause seems to be the new output variable computation_time, which is scalar and somehow messes up the 1D-variables like kernel. I am on it!

sstaehler commented 9 years ago

Unfortunately, not fixed yet by 8299e5bf4740d1b487a47f2548b492c3897d3116 and later

seismology / mc_kernel

Latest version crashes after creating file to store intermediate results (both gcc and intel) #18