Closed MTCam closed 4 months ago
A few more details...
If I print out the iname object corresponding to the name in the frozenset from within_inames
, I get:
kernel.inames[iname]=Iname(name='idof_ensm30', tags=frozenset({DiscretizationDOFAxisTag()}))
The instruction itself seems to vary from run to run. Here are a couple:
insn=Assignment(
temp_var_type=Optional(),
depends_on=frozenset(),
atomicity=(),
depends_on_is_final=False,
tags=frozenset({
EinsumTag(orig_loop_nest=frozenset({'iel_0_0_', 'idof_0_0__6'}))}),
within_inames_is_final=False, predicates=frozenset(),
expression=Subscript(
Variable('_pt_dist_id_176'),
(
Subscript(Variable('from_el_indices_6'), (0, 0)),
Subscript(Variable('_pt_dist_id_163'), (0, Variable('idof_ensm30'))))),
id='_mm_contract_cse_8',
no_sync_with=frozenset(),
within_inames=frozenset({'idof_ensm30'}),
assignee=Variable('cse_8_subst_0'),
groups=frozenset(),
priority=0,
conflicts_with_groups=frozenset())
insn=Assignment(
id='_mm_contract_cse_99',
assignee=Variable('cse_99_subst_0'),
depends_on=frozenset({'cse_97_store', '_pt_temp_30_store'}),
conflicts_with_groups=frozenset(),
priority=0,
depends_on_is_final=False,
groups=frozenset(),
tags=frozenset({
EinsumTag(orig_loop_nest=frozenset({'iel_0_0__5', 'idof_0_0__14'}))}),
atomicity=(),
within_inames=frozenset({'idof_ensm10'}),
expression=If(
Comparison(
Subscript(
Variable('_pt_temp_30'),
(0,)),
'>',
1.0),
Call(
Variable('abs'),
(Quotient(Sum((1.0, ...)), Sum((..., 1e-13))),)),
1.0),
no_sync_with=frozenset(),
temp_var_type=Optional(),
predicates=frozenset(),
within_inames_is_final=False)
If I'm interpreting within_inames
correctly, it sounds like these instructions belong to loops that only iterate over the DOF axis, not the element axis too? Is this not allowed @inducer?
Still very confused as to how this could be intermittent...
(...) expression=If( Comparison( Subscript( Variable('_pt_temp_30'), (0,)), '>', 1.0), Call( Variable('abs'), (Quotient(Sum((1.0, ...)), Sum((..., 1e-13))),)), 1.0), (...)
fwiw, this appears to be from the species limiting which uses the bound-preserving limiter where this appears (in limiter.py
):
# Linear scaling of polynomial coefficients
_theta = actx.np.minimum(
1.0, actx.np.where(actx.np.less(mmin_i, mmin),
abs((mmin-cell_avgs)/(mmin_i-cell_avgs+1e-13)),
1.0)
)
Still very confused as to how this could be intermittent...
Setting pyro_temp_iter: 4
in the KS3D config and running with 4 MPI ranks seems to produce the error pretty reliably for me. Any KS3D-like runs with nranks >= 128 have also been getting this every time. Sometimes it happens when compiling update_smoothness
and sometimes when compiling unfiltered_rhs
.
Edit: Have not been able to reproduce in serial at all.
(Nothing new here, just summarizing the developments from the past few weeks so they don't get lost.)
NotImplementedError: Cannot fit loop nest 'frozenset({'idof_ensm32'})' into known set of loop-nest patterns.
and other times it's
NotImplementedError: The <iel,idof> loop 'frozenset({'iel_ensm32', 'idof_ensm32'})' has the idof-loop that's not nested within the iel-loop.
_mm_contract_cse_5:
cse_5_subst_0 <- _pt_dist_id_487[
from_el_indices_13[0, 0],
_pt_dist_id_239[0, idof_ensm32]]
{tags=EinsumTag(orig_loop_nest=frozenset({'iel_0_0_', 'idof_0_0__7'}))}
Of note here is that from_el_indices_13
should have an element index in dim 0, but it doesn't.
BatchedEinsumPytatoPyOpenCLArrayContext
.Just a quick update, running the nowall
option to de-wedge this for larger processor counts worked for the 256 case, but not for larger cases. We still get the same error.
This appears to be another manifestation of this same issue. Now it happening far more often than it once did. This is a 2D case (smoke_test_ks) on 2 ranks. Now causing CI@mirgecom to fail quite often.
File "/shared/home/githubrunner/actions-runner/_work/mirgecom/emirge.prod/miniforge3/envs/ceesd/lib/python3.11/site-packages/y3prediction/prediction.py", line 4168, in my_rhs
rhs_state = unfiltered_rhs_compiled(t, state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/arraycontext/arraycontext/impl/pytato/compile.py", line 366, in __call__
compiled_func = self._dag_to_compiled_func(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/grudge/grudge/array_context.py", line 307, in _dag_to_compiled_func
) = self._dag_to_transformed_pytato_prg(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/arraycontext/arraycontext/impl/pytato/compile.py", line 445, in _dag_to_transformed_pytato_prg
.with_transformed_translation_unit(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/pytato/pytato/target/loopy/__init__.py", line 292, in with_transformed_translation_unit
program=f(self.program.t_unit).executor(self.cl_context),
^^^^^^^^^^^^^^^^^^^^^^
File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/meshmode/meshmode/array_context.py", line 1742, in transform_loopy_program
raise err
File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/meshmode/meshmode/array_context.py", line 1739, in transform_loopy_program
iel_to_idofs = _get_iel_to_idofs(knl)
^^^^^^^^^^^^^^^^^^^^^^
File "/shared/home/githubrunner/actions-runner/_work/mirgecom/mirgecom/src/meshmode/meshmode/array_context.py", line 1049, in _get_iel_to_idofs
raise NotImplementedError("The <iel,idof> loop "
NotImplementedError: The <iel,idof> loop 'frozenset({'idof_ensm6', 'iel_ensm6'})' has the idof-loop that's not nested within the iel-loop.
2023-09-07 10:46:27,832 - INFO - arraycontext.impl.pytato.compile - transform_dag for 'unfiltered_rhs': completed (12.35s wall 1.00x CPU)
Hopefully we're inching closer to being able to use the batched-einsum array context. We've identified (and, with Kaushik's help, hopefully fixed) at least the first issue that was keeping @majosm from using it with mirgecom.
This error occurs intermittently with the
drivers_y3-prediction
driver. To reproduce on Lassen, cloneillinois-ceesd/drivers_y3-prediction
, install it, and run the smoke_test_ks_3d on 4 ranks over and over again until the error occurs.