idefix-code / idefix

A fast finite volume code designed to run on many architectures, such as GPU, CPU and manycores, using Kokkos.
https://idefix.readthedocs.io/
Other
28 stars 19 forks source link

BUG: output files after #0000 written incompletely #259

Closed birnstiel closed 1 month ago

birnstiel commented 2 months ago

Describe the issue:

I am running a slightly modified version of the VSI test (in 3D, different resolution, lower scale height, dumps+outputs every 10 orbits). The outputs are written fine according to the log file:

Vtk: Write file data.0002.vtk...done in 3.206488e+00 s.
Dump: Write file n 2...done in 1.527347e+00 s.

However this is the output directory listing, note the very small sizes of the vtk and dmp files after the first output.

6,6G  8. Sep 13:57 data.0000.vtk
8,0K  8. Sep 18:56 data.0001.vtk
8,0K  9. Sep 00:06 data.0002.vtk
7,6G  8. Sep 13:57 dump.0000.dmp
 56K  8. Sep 18:56 dump.0001.dmp
 56K  9. Sep 00:06 dump.0002.dmp

I checked the dump file, and it seems it wrote the header and coordinates fine, but fails when reading the data of the first field Vc-RHO.

Is this something you have seen before or an issue with the file system?

Error message:

No response

runtime information:

This is how the code is built in the slurm script:

module load spack/2024.04
module load cmake/3.20.2-gcc-11.4.1 cuda/11.8.0
module load openmpi/5.0.0-gcc-11.4.1-cuda11.8

# this is for an H100
cmake $IDEFIX_DIR -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_HOPPER90=ON
make -j 2

Dump file header is Idefix 2.1.01-2f15373c Dump Data little endian.

Below I add the beginning of the log file:

-- Setting default Kokkos CXX standard to 17
-- Kokkos version: 4.3.1
-- The project name is: Kokkos
-- Using internal gtest for testing
-- Compiler Version: 11.8.89
-- kokkos_launch_compiler (/idefix/src/kokkos/bin/kokkos_launch_compiler) is enabled...
-- Using -std=c++17 for C++17 standard as feature
-- Built-in Execution Spaces:
--     Device Parallel: Kokkos::Cuda
--     Host Parallel: NoTypeDefined
--       Host Serial: SERIAL
-- 
-- Architectures:
--  HOPPER90
-- Using internal desul_atomics copy
-- Kokkos Backends: SERIAL;CUDA
-- Idefix final configuration
--     MHD:  OFF
--     MPI:  OFF
--     HDF5: OFF
--     Reconstruction: Linear
--     Precision: Double
--     Version: 2.1.01-2f15373c
--     Problem definitions: 'definitions.hpp'
-- Configuring done (1.0s)
-- Generating done (0.4s)
-- Build files have been written to: /idefix-setups/idefix_VSI
[  3%] Built target kokkossimd
[  3%] Built target AlwaysCheckGit
[  6%] Built target impl_git_version
[ 40%] Built target kokkoscore
[ 43%] Built target kokkoscontainers
[ 46%] Building CXX object CMakeFiles/idefix.dir/src/output/dump.cpp.o
[ 46%] Building CXX object CMakeFiles/idefix.dir/src/dataBlock/dumpToFile.cpp.o
[ 47%] Building CXX object CMakeFiles/idefix.dir/src/output/vtk.cpp.o
[ 49%] Building CXX object CMakeFiles/idefix.dir/src/input.cpp.o
[ 50%] Linking CXX executable idefix
[100%] Built target idefix
Starting job
I'm on Host [...].physik.uni-muenchen.de
It's now So 8. Sep 13:56:51 CEST 2024
                                  .:HMMMMHn:.  ..:n..
                                .H*'``     `'%HM'''''!x.
         :x                    x*`           .(MH:    `#h.
        x.`M                   M>        :nMMMMMMMh.     `n.
         *kXk..                XL  nnx:.XMMMMMMMMMMML   .. 4X.
          )MMMMMx              'M   `^?M*MMMMMMMMMMMM:HMMMHHMM.
          MMMMMMMX              ?k    'X ..'*MMMMMMM.#MMMMMMMMMx
         XMMMMMMMX               4:    M:MhHxxHHHx`MMx`MMMMMMMMM>
         XM!`   ?M                `x   4MM'`''``HHhMMX  'MMMMMMMM
         4M      M                 `:   *>     `` .('MX   '*MMMM'
          MX     `X.nnx..                        ..XMx`     'M*X
           ?h.    ''```^'*!Hx.     :Mf     xHMh  M**MMM      4L`
            `*Mx           `'*n.x. 4M>   :M` `` 'M    `       %
             '%                ``*MHMX   X>      !
            :!                    `#MM>  X>      `   :x
           :M                        ?M  `X     .  ..'M
           XX                       .!*X  `x   XM( MMx`h
          'M>::                        `M: `+  MMX XMM `:
          'M> M                         'X    'MMX ?MMk.Xx..
          'M> ?L                     ...:!     MMX.H**'MMMM*h
           M>  #L                  :!'`MM.    . X*`.xHMMMMMnMk.
           `!   #h.      :L           XM'*hxHMM*MhHMMMMMMMMMM'#h
           +     XMh:    4!      x   :f   MM'   `*MMMMMMMMMM%  `X
           M     Mf``tHhxHM      M>  4k xxX'      `#MMMMMMMf    `M .>
          :f     M   `MMMMM:     M>   M!MMM:         '*MMf'     'MH*
          !     Xf   'MMMMMX     `X   X>'h.`          :P*Mx.   .d*~..
        :M      X     4MMMMM>     !   X~ `Mh.      .nHL..M#'%nnMhH!'`
       XM      d>     'X`'**h     'h  M   ^'MMHH+*'`  ''''   `'**'
    %nxM>      *x+x.:. XL.. `k     `::X
:nMMHMMM:.  X>  Mn`*MMMMMHM: `:     ?MMn.
    `'**MML M>  'MMhMMMMMMMM  #      `M:^*x
         ^*MMttnnMMMMMMMMMMMH>.        M:.4X
                        `MMMM>X   (   .MMM:MM!   .
                          `'''4x.dX  +^ `''MMMMHM?L..
                                ``'           `'`'`'`

              Idefix version 2.1.01-2f15373c
              Built against Kokkos 40301
              Compiled on Sep  8 2024 at 13:56:42

Main: initialization stage.
Main: initialisation finished.
Main: running on [...].physik.uni-muenchen.de
-----------------------------------------------------------------------------
Input Parameters using input file idefix.ini:
-----------------------------------------------------------------------------
[Boundary]
    X1-beg      userdef
    X1-end      outflow
    X2-beg      outflow
    X2-end      outflow
    X3-beg      periodic
    X3-end      periodic
[Gravity]
    Mcentral        1.0
    gravCst     1
    potential       central
    skip        1
[Grid]
    X1-grid     1   1.0 1280    l   3.0
    X2-grid     1   1.2707963267948965  384 u   1.8707963267948966
    X3-grid     1   0.0 512 u   1.5707963267948966
[Hydro]
    csiso       userdef
    solver      hllc
[Output]
    dmp     62.831853071795865
    dmp_dir     /idefix_VSI3D
    log     100
    vtk     62.831853071795865
    vtk_dir     /idefix_VSI3D
[Setup]
    epsilon     0.05
[TimeIntegrator]
    CFL     0.8
    CFL_max_var     1.1
    check_nan       100
    first_dt        1.e-3
    max_runtime     -1
    maxdivB     1e-06
    nstages     2
    tstop       1256.6370614359
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
Input: Kokkos configuration
Device Execution Space:
  KOKKOS_ENABLE_CUDA: yes
Cuda Options:
  KOKKOS_ENABLE_CUDA_LAMBDA: yes
  KOKKOS_ENABLE_CUDA_LDG_INTRINSIC: yes
  KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE: no
  KOKKOS_ENABLE_CUDA_UVM: no
  KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA: yes
  KOKKOS_ENABLE_IMPL_CUDA_MALLOC_ASYNC: yes

Cuda Runtime Configuration:
macro  KOKKOS_ENABLE_CUDA      : defined
macro  CUDA_VERSION          = 11080 = version 11.8
Kokkos::Cuda[ 0 ] NVIDIA H100 NVL capability 9.0, Total Global Memory: 93.12 G, Shared Memory per Block: 48 K : Selected
-----------------------------------------------------------------------------
Input: Compiled with DOUBLE PRECISION arithmetic.
Input: DIMENSIONS=3.
Input: COMPONENTS=3.
Grid: full grid size is 
     Direction X1: userdef  1....1280....3  outflow
     Direction X2: outflow  1.2708....384....1.8708 outflow
     Direction X3: periodic 0....512....1.5708  periodic
Hydro: solving HD equations.
Hydro: Reconstruction: 2nd order (PLM Van Leer)
EquationOfState: isothermal with user-defined cs function.
RiemannSolver: hllc (HD).
Gravity: ENABLED.
Gravity: G=1.
Gravity: central mass gravitational potential ENABLED with M=1
TimeIntegrator: using 2nd Order (RK2) integrator.
TimeIntegrator: Using adaptive dt with CFL=0.8 .
Main: Creating initial conditions.
Vtk: Write file data.0000.vtk...done in 9.42678 s.
Dump: Write file n 0...done in 8.82041 s.
Main: Cycling Time Integrator...
TimeIntegrator:             time |            cycle |        time step | cell (updates/s)
TimeIntegrator:     0.000000e+00 |                0 |     1.000000e-03 |              N/A
TimeIntegrator:     1.686485e-01 |              100 |     1.716926e-03 |     7.347258e+08
glesur commented 2 months ago

Never seen this, but this is very likely a bug in the serial I/O routine, possibly related to the filesystem (or not). Could you try enabling MPI? (no need to run on several GPUs, but enabling MPI will use another output procedure, possibly more reliable for large datasets).

birnstiel commented 2 months ago

Thanks for the quick answer! I will try that. That would be adding the Idefix_MPI option and running with mpirun with just one process? This is on a lustre file system, in case this matters.

glesur commented 2 months ago

yes, Idefix_MPI=ON in cmake, and then run with one process. Lustre can be touchy for large files, and the serial routines are pretty basic...

birnstiel commented 2 months ago

There was another run that was able to write full files only every now and then. This was very puzzling. The MPI option turned out to help because it now gave an error message: I was running out of disk space! Turns out writing files every now and then worked because in the background some old data was moved away. This is embarrassing, sorry for bothering you! 🤦‍♂️

glesur commented 2 months ago

Well, it still points to a defect, which is that the code is unable to identify cases where outputs were unsuccesfull using serial I/Os, so I'd say there is something to be fixed here!