spcl / dace

DaCe - Data Centric Parallel Programming
http://dace.is/fast
BSD 3-Clause "New" or "Revised" License
496 stars 129 forks source link

Python parsing failure: temporary look up #1306

Closed FlorianDeconinck closed 1 year ago

FlorianDeconinck commented 1 year ago

Describe the bug A previously parsing python code (0.14) is now failing. The error seems to point to a bookeeping issue with the temporary when updating their strides, post-parse. The stack shows:

dsl/pace/dsl/dace/orchestration.py:289: in _parse_sdfg
    sdfg = daceprog.to_sdfg(
.venv/lib/python3.8/site-packages/dace/frontend/python/parser.py:267: in to_sdfg
    sdfg = self._parse(args, kwargs, simplify=simplify, save=save, validate=validate)
.venv/lib/python3.8/site-packages/dace/frontend/python/parser.py:480: in _parse
    sdfg, cached = self._generate_pdp(args, kwargs, simplify=simplify)
.venv/lib/python3.8/site-packages/dace/frontend/python/parser.py:884: in _generate_pdp
    sdfg = newast.parse_dace_program(self.name,
.venv/lib/python3.8/site-packages/dace/frontend/python/newast.py:211: in parse_dace_program
    sdfg, _, _, _ = visitor.parse_program(preprocessed_ast.preprocessed_ast.body[0])
.venv/lib/python3.8/site-packages/dace/frontend/python/newast.py:1202: in parse_program
    self.visit_TopLevel(stmt)
.venv/lib/python3.8/site-packages/dace/frontend/python/astutils.py:479: in visit_TopLevel
    self.visit(node)
.venv/lib/python3.8/site-packages/dace/frontend/python/newast.py:1178: in visit
    return super().visit(node)
../../../.pyenv/versions/3.8.10/lib/python3.8/ast.py:371: in visit
    return visitor(node)
.venv/lib/python3.8/site-packages/dace/frontend/python/newast.py:2283: in visit_For
    laststate, first_loop_state, last_loop_state, _ = self._recursive_visit(node.body,
.venv/lib/python3.8/site-packages/dace/frontend/python/newast.py:2148: in _recursive_visit
    self.visit_TopLevel(stmt)
.venv/lib/python3.8/site-packages/dace/frontend/python/astutils.py:477: in visit_TopLevel
    getattr(self, "visit_TopLevel" + clsname)(node)
.venv/lib/python3.8/site-packages/dace/frontend/python/newast.py:4477: in visit_TopLevelExpr
    self.visit_Call(node.value)
.venv/lib/python3.8/site-packages/dace/frontend/python/newast.py:4306: in visit_Call
    return self._parse_sdfg_call(funcname, func, node)
.venv/lib/python3.8/site-packages/dace/frontend/python/newast.py:3944: KeyError
>           if len(strides) == len(sdfg.arrays[a].shape):
E           KeyError: '__tmp249'

To Reproduce I couldn't muster a smaller reproduction (aplogies!) so this uses the Pace repository regresstion test for acoustics. It will built on 6 nodes.

# Repo is to run the AcousticDynamics regression test
# Original code: fv3core/pace/fv3core/stencils/dyn_core.py
# DaCe is applied on the AcousticDynamics .__call__ function

# Get Pace repository
git clone -b update/gt4py_dace git@github.com:GEOS-ESM/pace
cd pace
git submodule init
git submodule update

# Setup the venv, including Cupy for GPU
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install external/gt4py/
pip install -r requirements_dev.txt -c constraints.txt
pip install cupy-cuda12x # REPLACED with relevant CUDA
pip install -r requirements_

# Download data
mkdir -p test_data/8.1.3/c12_6ranks_standard/dycore
cd test_data/8.1.3/c12_6ranks_standard/dycore
pip install gdown
gdown https://drive.google.com/uc?id=1p4GvdSof10BYTV9-sYTTWcLkX9RScQJn
gdown https://drive.google.com/uc?id=1a5TbVqFmqmVX5qzGMFnPy3xn7Qj0by8j
gdown https://drive.google.com/uc?id=1SOO97ncz-fCGVoPD7pUuYY9uYowUjato
gdown https://drive.google.com/uc?id=1Wcb1l7GXE5C_82oItGo7RkKloWlJCKR4
cd -

# Run test of FvTp2d
export FV3_DACEMODE=BuildAndRun 
export PACE_CONSTANTS=GFS
mpirun -np 6 \
       pytest -v -s --data_path=./test_data/8.1.3/c12_6ranks_standard/dycore \
       -m parallel \
       --backend=dace:gpu --which_modules=DynCore --which_rank=0 \
       --threshold_overrides_file=./fv3core/tests/savepoint/translate/overrides/standard.yaml \
       ./fv3core/tests/savepoint

Expected behavior The example above is a regression test but, success to parse and generate the code in .gt_cache_A/dacecache should be considered success even the test itself fails.

alexnick83 commented 1 year ago

There is an argument (probably from the closure arrays, but I need to look further into it) that is not in the (nested) SDFG (arrays). I made PR #1314, which filters out such arguments (and may be dangerous). It allows the test above to proceed, but it eventually segfaults. Maybe this provides a hint? It would be helpful if we could somehow isolate the test in a way that it can be run in debug mode (i.e., no MPI).

FlorianDeconinck commented 1 year ago

One node reproducer, using the full dycore but shows the same issue

# Get Pace repository
git clone -b update/gt4py_dace git@github.com:GEOS-ESM/pace
cd pace
git submodule init
git submodule update

# Setup the venv, including Cupy for GPU
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install external/gt4py/
pip install -r requirements_dev.txt -c constraints.txt
pip install cupy-cuda12x # REPLACED with relevant CUDA

# Download data
pip install gdown
gdown https://drive.google.com/uc?id=13BzQUdZvJRS119SNpSzj1rSVrscidS3A #acoustics_driver.py
gdown https://drive.google.com/uc?id=1zNjy06xF8e6W4i_7MJO6K431fj-1YOUo #baroclinic_c12.yaml

# Run test of FvTp2d
export FV3_DACEMODE=BuildAndRun 
export PACE_CONSTANTS=GFS
python acoustics_driver.py
alexnick83 commented 1 year ago

I can confirm my previous findings. The array q_advected_x (rewritten as __g_self_q_advected_x is in the closure arrays of a nested DaCe program without actually being in the corresponding SDFG's arrays (or symbols). I am posting the nested program's code below for reference. Since the program is making other calls, I believe that the array is used somewhere in the call tree but it is optimized away (my understanding is that gt4py enables automatic simplification by default). I updated the fix (in #1314) to not add such arrays in the program's arguments.

@tbennun since you wrote the closure resolver, do you think that the above is (very) likely to occur, or do we need to dig deeper to find why closure arrays are missing from their corresponding SDFG?

@FlorianDeconinck with the fix, code seems to be generated, but the test eventually fails overall with the following error. Is this something different?

[DaCe Config] Rank 0 loading SDFG /scratch2/alziogas/pace/.gt_cache_FV3_A/dacecache/DriverAcoustics__critical_path_step_all
2023-07-14 14:29:46|INFO|rank 0|pace.util.logging:DynCore
2023-07-14 14:29:46|INFO|rank 0|pace.util.logging:TracerAdvection
2023-07-14 14:29:46|INFO|rank 0|pace.util.logging:DynCore
2023-07-14 14:29:46|INFO|rank 0|pace.util.logging:TracerAdvection
2023-07-14 14:29:47|INFO|rank 0|pace.util.logging:Remapping
2023-07-14 14:29:47|INFO|rank 0|pace.util.logging:Omega
2023-07-14 14:29:47|INFO|rank 0|pace.util.logging:Del2Cubed
2023-07-14 14:29:47|INFO|rank 0|pace.util.logging:Neg Adj 3
2023-07-14 14:29:47|INFO|rank 0|pace.util.logging:CubedToLatLon
Exception ignored on calling ctypes callback function: <function flatten_callback.<locals>.make_cb.<locals>.cb_func at 0x7f15b8d6eb80>
Traceback (most recent call last):
  File "/home/alziogas/Projects/dace/dace/frontend/python/preprocessing.py", line 427, in cb_func
    return func(*args, **kwargs)
  File "/scratch2/alziogas/pace/util/pace/util/_timing.py", line 86, in __exit__
    self.timer.stop(name)
  File "/scratch2/alziogas/pace/util/pace/util/_timing.py", line 43, in stop
    self._accumulated_time[name] += time() - self._clock_starts.pop(name)
KeyError: 'mainloop'
2023-07-14 14:29:47|INFO|rank 0|pace.util.logging:Finished stepping 0
...
/scratch2/alziogas/pace/driver/pace/driver/performance/report.py:125: RuntimeWarning: divide by zero encountered in double_scalars
  speedup = dt_atmos / mainloop

The programs code:

    def __call__(
        self,
        state: DycoreState,
        timestep: Float,  # time to step forward by in seconds
        n_map=1,  # [DaCe] replaces state.n_map
    ):
        # u, v, w, delz, delp, pt, pe, pk, phis, wsd, omga, ua, va, uc, vc, mfxd,
        # mfyd, cxd, cyd, pkz, peln, q_con, ak, bk, diss_estd, cappa, mdt, n_split,
        # akap, ptop, n_map, comm):
        end_step = n_map == self.config.k_split
        # dt = state.mdt / self.config.n_split
        dt_acoustic_substep: Float = self.dt_acoustic_substep(timestep)
        dt2: Float = self.dt2(dt_acoustic_substep)
        n_split = self.config.n_split
        # NOTE: In Fortran model the halo update starts happens in fv_dynamics, not here
        self._halo_updaters.q_con__cappa.start()
        self._halo_updaters.delp__pt.start()
        self._halo_updaters.u__v.start()
        self._halo_updaters.q_con__cappa.wait()

        self._zero_data(
            state.mfxd,
            state.mfyd,
            state.cxd,
            state.cyd,
            self._heat_source,
            state.diss_estd,
            n_map == 1,
        )

        # "acoustic" loop
        # called this because its timestep is usually limited by horizontal sound-wave
        # processes. Note this is often not the limiting factor near the poles, where
        # the speed of the polar night jets can exceed two-thirds of the speed of sound.
        for it in dace_nounroll(range(n_split)):
            # the Lagrangian dynamics have two parts. First we advance the C-grid winds
            # by half a time step (c_sw). Then the C-grid winds are used to define
            # advective fluxes to advance the D-grid prognostic fields a full time step
            # (the rest of the routines).
            #
            # Along-surface flux terms (mass, heat, vertical momentum, vorticity,
            # kinetic energy gradient terms) are evaluated forward-in-time.
            #
            # The pressure gradient force and elastic terms are then evaluated
            # backwards-in-time, to improve stability.
            remap_step = False
            if self.config.breed_vortex_inline or (it == n_split - 1):
                remap_step = True
            if not self.config.hydrostatic:
                self._halo_updaters.w.start()
                if it == 0:
                    self._gz_from_surface_height_and_thickness(
                        self._zs,
                        state.delz,
                        self._gz,
                    )
                    self._halo_updaters.gz.start()
            if it == 0:
                self._halo_updaters.delp__pt.wait()

            if it == n_split - 1 and end_step:
                if self.config.use_old_omega:
                    self._interface_pressure_from_toa_pressure_and_thickness(
                        state.delp,
                        self._pem,
                        self._ptop,
                    )

            self._halo_updaters.u__v.wait()
            if not self.config.hydrostatic:
                self._halo_updaters.w.wait()

            # compute the c-grid winds at t + 1/2 timestep
            self._checkpoint_csw(state, tag="In")
            self.cgrid_shallow_water_lagrangian_dynamics(
                state.delp,
                state.pt,
                state.u,
                state.v,
                state.w,
                state.uc,
                state.vc,
                state.ua,
                state.va,
                self._ut,
                self._vt,
                self._divgd,
                state.omga,
                dt2,
            )
            self._checkpoint_csw(state, tag="Out")

            # TODO: Computing the pressure gradient outside of C_SW was originally done
            # so that we could transpose into a vertical-first memory ordering for the
            # gz computation, now that we have gt4py we should pull this into C_SW.
            if self.config.nord > 0:
                self._halo_updaters.divgd.start()
            if not self.config.hydrostatic:
                # TODO: is there some way we can avoid aliasing gz and zh, so that
                # gz is always a geopotential and zh is always a height?
                if it == 0:
                    self._halo_updaters.gz.wait()
                    self._copy_stencil(
                        self._gz,
                        self._zh,
                    )
                else:
                    self._copy_stencil(
                        self._zh,
                        self._gz,
                    )
            if not self.config.hydrostatic:
                self.update_geopotential_height_on_c_grid(
                    self._zs, self._ut, self._vt, self._gz, self._ws3, dt2
                )
                # TODO (floriand): Due to DaCe VRAM pooling creating a memory
                # leak with the usage pattern of those two fields
                # We use the C_SW internal to workaround it e.g.:
                #  - self.cgrid_shallow_water_lagrangian_dynamics.delpc
                #  - self.cgrid_shallow_water_lagrangian_dynamics.ptc
                # DaCe has already a fix on their side and it awaits release
                # issue
                self.vertical_solver_cgrid(
                    dt2,
                    self.cappa,
                    self._ptop,
                    state.phis,
                    self._ws3,
                    self.cgrid_shallow_water_lagrangian_dynamics.ptc,
                    state.q_con,
                    self.cgrid_shallow_water_lagrangian_dynamics.delpc,
                    self._gz,
                    self._pkc,
                    state.omga,
                )

            self._p_grad_c(
                self.grid_data.rdxc,
                self.grid_data.rdyc,
                state.uc,
                state.vc,
                self.cgrid_shallow_water_lagrangian_dynamics.delpc,
                self._pkc,
                self._gz,
                dt2,
            )
            self._halo_updaters.uc__vc.start()
            if self.config.nord > 0:
                self._halo_updaters.divgd.wait()
            self._halo_updaters.uc__vc.wait()
            # use the computed c-grid winds to evolve the d-grid winds forward
            # by 1 timestep
            self._checkpoint_dsw_in(state)
            self.dgrid_shallow_water_lagrangian_dynamics(
                self._vt,
                state.delp,
                state.pt,
                state.u,
                state.v,
                state.w,
                state.uc,
                state.vc,
                state.ua,
                state.va,
                self._divgd,
                state.mfxd,
                state.mfyd,
                state.cxd,
                state.cyd,
                self._crx,
                self._cry,
                self._xfx,
                self._yfx,
                state.q_con,
                self._zh,
                self._heat_source,
                state.diss_estd,
                dt_acoustic_substep,
            )
            self._checkpoint_dsw_out(state)
            # note that uc and vc are not needed at all past this point.
            # they will be re-computed from scratch on the next acoustic timestep.

            self._halo_updaters.delp__pt__q_con.update()

            # Not used unless we implement other betas and alternatives to nh_p_grad
            # if self.namelist.d_ext > 0:
            #    raise 'Unimplemented namelist option d_ext > 0'

            # TODO: should the dycore have hydrostatic and non-hydrostatic modes,
            # or would we make a new class for the non-hydrostatic mode?
            if not self.config.hydrostatic:
                # without explicit arg names, numpy does not run
                self.update_height_on_d_grid(
                    surface_height=self._zs,
                    height=self._zh,
                    courant_number_x=self._crx,
                    courant_number_y=self._cry,
                    x_area_flux=self._xfx,
                    y_area_flux=self._yfx,
                    ws=self._wsd,
                    dt=dt_acoustic_substep,
                )
                self.vertical_solver(
                    remap_step,
                    dt_acoustic_substep,
                    self.cappa,
                    self._ptop,
                    self._zs,
                    self._wsd,
                    state.delz,
                    state.q_con,
                    state.delp,
                    state.pt,
                    self._zh,
                    state.pe,
                    self._pkc,
                    self._pk3,
                    state.pk,
                    state.peln,
                    state.w,
                )

                self._halo_updaters.zh.start()
                self._halo_updaters.pkc.start()
                if remap_step:
                    # TODO: can this be moved to the start of the remapping routine?
                    self._edge_pe_stencil(state.pe, state.delp, self._ptop)
                if self.config.use_logp:
                    raise NotImplementedError(
                        "unimplemented namelist option use_logp=True"
                    )
                else:
                    self._pk3_halo(self._pk3, state.delp, self._ptop, self._akap)
            if not self.config.hydrostatic:
                self._halo_updaters.zh.wait()
                self._compute_geopotential_stencil(
                    self._zh,
                    self._gz,
                )
                self._halo_updaters.pkc.wait()

                self.nonhydrostatic_pressure_gradient(
                    state.u,
                    state.v,
                    self._pkc,
                    self._gz,
                    self._pk3,
                    state.delp,
                    dt_acoustic_substep,
                    self._ptop,
                    self._akap,
                )

            if self.config.rf_fast:
                # TODO: Pass through ks, or remove, inconsistent representation vs
                # Fortran.
                self._rayleigh_damping(
                    u=state.u,
                    v=state.v,
                    w=state.w,
                    dp=self._dp_ref,
                    pfull=self._pfull,
                    dt=dt_acoustic_substep,
                    ptop=self._ptop,
                )

            if it != n_split - 1:
                # [DaCe] this should be a reuse of
                #        self._halo_updaters.u__v but it creates
                #        parameter generation issues, and therefore has been duplicated
                self._halo_updaters.u__v.start()
            else:
                if self.config.grid_type < 4:
                    self._halo_updaters.interface_uc__vc.interface()

        # we are here

        if self._do_del2cubed:
            self._halo_updaters.heat_source.update()
            # TODO: move dependence on da_min into init of hyperdiffusion class
            da_min: Float = self._get_da_min()
            cd = constants.CNST_0P20 * da_min
            # we want to diffuse the heat source from damping before we apply it,
            # so that we don't reinforce the same grid-scale patterns we're trying
            # to damp
            self._hyperdiffusion(self._heat_source, cd)
            if not self.config.hydrostatic:
                delt_time_factor = abs(dt_acoustic_substep * self.config.delt_max)
                # TODO: it looks like state.pkz is being used as a temporary here,
                # and overwritten at the start of remapping. See if we can make it
                # an internal temporary of this stencil.
                self._apply_diffusive_heating(
                    state.delp,
                    state.delz,
                    self.cappa,
                    self._heat_source,
                    state.pt,
                    delt_time_factor,
                )
FlorianDeconinck commented 1 year ago

@alexnick83 I will run test with #1314. This errors is not yours, I probably removed something from the model driver for the reproducer I shouldn't have

FlorianDeconinck commented 1 year ago

So I ran the regression test and got mixed results:

Validation has been verified with 0.14

Looking deeper

FlorianDeconinck commented 1 year ago

I believe this issue is close-able, since we cant generate the code.

The validation error is real, and massive, but could be anything since v0.14 since it's the latest functioning version. I will open a separate issue.

alexnick83 commented 1 year ago

It sounds like some data are not written, leading to garbage results. I would take the following step to identify the issue:

FlorianDeconinck commented 1 year ago

Can confirm the validation is unrelated to the issue in this ticket. (seems ConstantPropagation transformation creates the issue)

Tested with 2daf22

alexnick83 commented 1 year ago

Thanks. Then we will merge the pending PR and investigate the ConstantPropagation issue.