Intermittent loop nest error when compiling prediction case

MTCam commented 1 year ago

This error occurs intermittently with the drivers_y3-prediction driver. To reproduce on Lassen, clone illinois-ceesd/drivers_y3-prediction, install it, and run the smoke_test_ks_3d on 4 ranks over and over again until the error occurs.

cd smoke_test_ks_3d
jsrun -n 4 -g 1 -a 1 python -u -O -m mpi4py ./driver.py --lazy -i run_params.yaml

2023-07-31 09:59:30,098 - INFO - arraycontext.impl.pytato.compile - transform_loopy_program for 'update_smoothness': completed (74.94s wall 1.00x CPU)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "mirge.env/lib/python3.11/site-packages/mpi4py/__main__.py", line 7, in <module>
    main()
  File "mirge.env/lib/python3.11/site-packages/mpi4py/run.py", line 198, in main
    run_command_line(args)
  File "mirge.env/lib/python3.11/site-packages/mpi4py/run.py", line 47, in run_command_line
    run_path(sys.argv[0], run_name='__main__')
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "./driver.py", line 82, in <module>
    main(actx_class, restart_filename=restart_filename,
  File "/mirgecom/mirgecom/mpi.py", line 226, in wrapped_func
    func(*args, **kwargs)
  File "y3-prediction-scaling-run/y3prediction/prediction.py", line 2300, in main
    restart_stepper_state = update_smoothness_compiled(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/arraycontext/arraycontext/impl/pytato/compile.py", line 366, in __call__
    compiled_func = self._dag_to_compiled_func(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/grudge/grudge/array_context.py", line 307, in _dag_to_compiled_func
    ) = self._dag_to_transformed_pytato_prg(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/arraycontext/arraycontext/impl/pytato/compile.py", line 442, in _dag_to_transformed_pytato_prg
    .with_transformed_program(self
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pytato/pytato/target/loopy/__init__.py", line 151, in with_transformed_program
    return self.copy(program=f(self.program))
                             ^^^^^^^^^^^^^^^
  File "/meshmode/meshmode/array_context.py", line 1742, in transform_loopy_program
    raise err
  File "/meshmode/meshmode/array_context.py", line 1739, in transform_loopy_program
    iel_to_idofs = _get_iel_to_idofs(knl)
                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/meshmode/meshmode/array_context.py", line 1068, in _get_iel_to_idofs
    raise NotImplementedError(f"Cannot fit loop nest '{insn.within_inames}'"
NotImplementedError: Cannot fit loop nest 'frozenset({'idof_ensm2'})' into known set of loop-nest patterns.

majosm commented 1 year ago

A few more details...

If I print out the iname object corresponding to the name in the frozenset from within_inames, I get:

kernel.inames[iname]=Iname(name='idof_ensm30', tags=frozenset({DiscretizationDOFAxisTag()}))

The instruction itself seems to vary from run to run. Here are a couple:

insn=Assignment(
    temp_var_type=Optional(),
    depends_on=frozenset(),
    atomicity=(),
    depends_on_is_final=False,
    tags=frozenset({
        EinsumTag(orig_loop_nest=frozenset({'iel_0_0_', 'idof_0_0__6'}))}),
    within_inames_is_final=False, predicates=frozenset(),
    expression=Subscript(
        Variable('_pt_dist_id_176'),
        (
            Subscript(Variable('from_el_indices_6'), (0, 0)),
            Subscript(Variable('_pt_dist_id_163'), (0, Variable('idof_ensm30'))))),
    id='_mm_contract_cse_8',
    no_sync_with=frozenset(),
    within_inames=frozenset({'idof_ensm30'}),
    assignee=Variable('cse_8_subst_0'),
    groups=frozenset(),
    priority=0,
    conflicts_with_groups=frozenset())

insn=Assignment(
    id='_mm_contract_cse_99',
    assignee=Variable('cse_99_subst_0'),
    depends_on=frozenset({'cse_97_store', '_pt_temp_30_store'}),
    conflicts_with_groups=frozenset(),
    priority=0,
    depends_on_is_final=False,
    groups=frozenset(),
    tags=frozenset({
        EinsumTag(orig_loop_nest=frozenset({'iel_0_0__5', 'idof_0_0__14'}))}),
    atomicity=(),
    within_inames=frozenset({'idof_ensm10'}),
    expression=If(
        Comparison(
            Subscript(
                Variable('_pt_temp_30'),
                (0,)),
            '>',
            1.0),
        Call(
            Variable('abs'),
            (Quotient(Sum((1.0, ...)), Sum((..., 1e-13))),)),
        1.0),
    no_sync_with=frozenset(),
    temp_var_type=Optional(),
    predicates=frozenset(),
    within_inames_is_final=False)

If I'm interpreting within_inames correctly, it sounds like these instructions belong to loops that only iterate over the DOF axis, not the element axis too? Is this not allowed @inducer?

Still very confused as to how this could be intermittent...

MTCam commented 1 year ago

(...)
    expression=If(
        Comparison(
            Subscript(
                Variable('_pt_temp_30'),
                (0,)),
            '>',
            1.0),
        Call(
            Variable('abs'),
            (Quotient(Sum((1.0, ...)), Sum((..., 1e-13))),)),
        1.0),
(...)

fwiw, this appears to be from the species limiting which uses the bound-preserving limiter where this appears (in limiter.py):

    # Linear scaling of polynomial coefficients
    _theta = actx.np.minimum(
        1.0, actx.np.where(actx.np.less(mmin_i, mmin),
                           abs((mmin-cell_avgs)/(mmin_i-cell_avgs+1e-13)),
                           1.0)
        )

Still very confused as to how this could be intermittent...

Setting pyro_temp_iter: 4 in the KS3D config and running with 4 MPI ranks seems to produce the error pretty reliably for me. Any KS3D-like runs with nranks >= 128 have also been getting this every time. Sometimes it happens when compiling update_smoothness and sometimes when compiling unfiltered_rhs.

Edit: Have not been able to reproduce in serial at all.

majosm commented 1 year ago

(Nothing new here, just summarizing the developments from the past few weeks so they don't get lost.)

Neither inducer/pytato#452 nor making the DAG hash deterministically seem to fix the intermittency.
- Note: never tried both together

Error specifics vary; sometimes it's

NotImplementedError: Cannot fit loop nest 'frozenset({'idof_ensm32'})' into known set of loop-nest patterns.

and other times it's

NotImplementedError: The <iel,idof> loop 'frozenset({'iel_ensm32', 'idof_ensm32'})' has the idof-loop that's not nested within the iel-loop.

In the latter case, the offending instruction is

_mm_contract_cse_5:
  cse_5_subst_0 <- _pt_dist_id_487[
      from_el_indices_13[0, 0],
      _pt_dist_id_239[0, idof_ensm32]]
  {tags=EinsumTag(orig_loop_nest=frozenset({'iel_0_0_', 'idof_0_0__7'}))}

Of note here is that from_el_indices_13 should have an element index in dim 0, but it doesn't.

It appears that this is caused by a kernel processing step that removes single-element loops. The loops correspond to boundaries that are sometimes created when an element face lies on both a rank boundary and a volume boundary (these boundaries can have any number of elements, but it's not uncommon to just have 1).
We can potentially avoid this by switching to BatchedEinsumPytatoPyOpenCLArrayContext.

MTCam commented 1 year ago

Just a quick update, running the nowall option to de-wedge this for larger processor counts worked for the 256 case, but not for larger cases. We still get the same error.

MTCam commented 1 year ago

This appears to be another manifestation of this same issue. Now it happening far more often than it once did. This is a 2D case (smoke_test_ks) on 2 ranks. Now causing CI@mirgecom to fail quite often.

  File "/shared/home/githubrunner/actions-runner/_work/mirgecom/emirge.prod/miniforge3/envs/ceesd/lib/python3.11/site-packages/y3prediction/prediction.py", line 4168, in my_rhs
    rhs_state = unfiltered_rhs_compiled(t, state)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/arraycontext/arraycontext/impl/pytato/compile.py", line 366, in __call__
    compiled_func = self._dag_to_compiled_func(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/grudge/grudge/array_context.py", line 307, in _dag_to_compiled_func
    ) = self._dag_to_transformed_pytato_prg(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/arraycontext/arraycontext/impl/pytato/compile.py", line 445, in _dag_to_transformed_pytato_prg
    .with_transformed_translation_unit(
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/pytato/pytato/target/loopy/__init__.py", line 292, in with_transformed_translation_unit
    program=f(self.program.t_unit).executor(self.cl_context),
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/meshmode/meshmode/array_context.py", line 1742, in transform_loopy_program
    raise err
  File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/meshmode/meshmode/array_context.py", line 1739, in transform_loopy_program
    iel_to_idofs = _get_iel_to_idofs(knl)
                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/meshmode/meshmode/array_context.py", line 1049, in _get_iel_to_idofs
    raise NotImplementedError("The <iel,idof> loop "
NotImplementedError: The <iel,idof> loop 'frozenset({'idof_ensm6', 'iel_ensm6'})' has the idof-loop that's not nested within the iel-loop.
2023-09-07 10:46:27,832 - INFO - arraycontext.impl.pytato.compile - transform_dag for 'unfiltered_rhs': completed (12.35s wall 1.00x CPU)

inducer commented 1 year ago

Hopefully we're inching closer to being able to use the batched-einsum array context. We've identified (and, with Kaushik's help, hopefully fixed) at least the first issue that was keeping @majosm from using it with mirgecom.

illinois-ceesd / mirgecom

Intermittent loop nest error when compiling prediction case #944