illinois-ceesd / mirgecom

MIRGE-Com is the workhorse simulation application for the Center for Exascale-Enabled Scramjet Design at the University of Illinois.
Other
12 stars 19 forks source link

Production driver y2-isolator fails in parallel on Lassen #662

Open MTCam opened 2 years ago

MTCam commented 2 years ago

The y2 production 3D injection with combustion fails to run on Lassen on more than 1 rank. After building for a while, the compilation stops with this error:

2022-05-09 06:15:34,123 - INFO - pyopencl - build program: kernel '_pt_kernel' was part of a lengthy source build resulting from a binary cache miss (3.16 s)
/p/gpfs1/mtcampbe/CEESD/AutomatedTesting/MIRGE-Timing/timing/emirge/miniforge3/envs/nozzle.lazy.timing.env/lib/python3.9/site-packages/pyopencl/invoker.py:366: UserWarning: Kernel '_pt_kernel_0' has 468 arguments with a total size of 3744 bytes, which approaches the limit of 4352 bytes on <pyopencl.Device 'Tesla V100-SXM2-16GB' on 'Portable Computing Language' at 0x102484bd8>. This might lead to compilation errors, especially on GPU devices.
(..... bunch of stuff ....)
CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed

To reproduce this problem:

  1. Install mirgecom@production
  2. Install drivers_y2-isolator@slim-faster

Run drivers_y2-isolator/smoke_test_injection_3d: Some env prep:

 1001  export PYOPENCL_CTX="port:tesla"
 1002  export XDG_CACHE_HOME="/tmp/mtcampbe/xdg-scratch"
 1003  export POCL_CACHE_DIR_ROOT="/tmp/$USER/pocl-cache"

Then these two commands prep and run the case, respectively:

  998  jsrun -g 1 -a 1 -n 2 bash -c 'POCL_CACHE_DIR=$POCL_CACHE_DIR_ROOT/$$ python -O -m mpi4py ./isolator_injection_init.py -i run_params.yaml --lazy' 
 1007  jsrun -g 1 -a 1 -n 2 bash -c 'POCL_CACHE_DIR=$POCL_CACHE_DIR_ROOT/$$ python -O -m mpi4py ./isolator_injection_run.py -i run_params.yaml -r restart_data/isolator_init-000000 --log --lazy'

FYI, here's the sub-pkg info from this run:

*** Requirements file with current emirge module versions
# requirements.txt created by version.sh
# Date: Mon May  9 06:50:05 PDT 2022
# Host: lassen33.coral.llnl.gov [Linux lassen33 4.14.0-115.35.1.3chaos.ch6a.ppc64le #1 SMP Wed Jul 21 17:12:16 PDT 2021 ppc64le ppc64le ppc64le GNU/Linux]
# Python: /p/gpfs1/mtcampbe/CEESD/AutomatedTesting/MIRGE-Timing/timing/emirge/miniforge3/envs/nozzle.lazy.timing.env/bin/python [Python 3.9.12]
--editable git+https://github.com/kaushikcfd/arraycontext.git@0e7d287#egg=arraycontext
--editable git+https://github.com/inducer/dagrt.git@v2021.1-10-gfddc6f6#egg=dagrt
--editable git+https://github.com/kaushikcfd/feinsum.git@2d9c2bb#egg=feinsum
--editable git+https://github.com/kaushikcfd/grudge.git@v2021.1-521-g8c7bd6b#egg=grudge
--editable git+https://github.com/inducer/leap.git@60773c1#egg=leap
--editable git+https://github.com/illinois-ceesd/logpyle.git@v2021.0-7-g42e89c8#egg=logpyle
--editable git+https://github.com/kaushikcfd/loopy.git@40248606#egg=loopy
--editable git+https://github.com/kaushikcfd/meshmode.git@3357afc#egg=meshmode
--editable git+https://github.com/illinois-ceesd/mirgecom@a81ffff#egg=mirgecom
--editable git+https://github.com/inducer/modepy.git@v2021.1-73-g92fc396#egg=modepy
--editable git+https://github.com/inducer/pymbolic.git@v2022.1#egg=pymbolic
--editable git+https://github.com/ecisneros8/pyrometheus.git@2136368#egg=pyrometheus
--editable git+https://github.com/kaushikcfd/pytato.git@17f8866#egg=pytato
matthiasdiener commented 2 years ago

Following these instructions currently results in the following error, so I can't debug this further:

  File "/shared/home/mdiener/Work/efuse2/mirgecom/mirgecom/integrators/lsrk.py", line 66, in euler_step
    return lsrk_step(EulerCoefs, state, t, dt, rhs)
  File "/shared/home/mdiener/Work/efuse2/mirgecom/mirgecom/integrators/lsrk.py", line 53, in lsrk_step
    k = coefs.A[i]*k + dt*rhs(t + coefs.C[i]*dt, state)
  File "/shared/home/mdiener/Work/efuse2/arraycontext/arraycontext/impl/pytato/compile.py", line 312, in __call__
    output_template = self.f(
  File "./isolator_injection_run.py", line 1097, in my_rhs
    ns_operator(discr, state=fluid_state, time=t, boundaries=boundaries,
  File "/shared/home/mdiener/Work/efuse2/mirgecom/mirgecom/navierstokes.py", line 400, in ns_operator
    viscous_flux_on_element_boundary(
  File "/shared/home/mdiener/Work/efuse2/mirgecom/mirgecom/viscous.py", line 429, in viscous_flux_on_element_boundary
    sum(_fvisc_divergence_flux_boundary(
  File "/shared/home/mdiener/Work/efuse2/mirgecom/mirgecom/viscous.py", line 429, in <genexpr>
    sum(_fvisc_divergence_flux_boundary(
  File "/shared/home/mdiener/Work/efuse2/mirgecom/mirgecom/viscous.py", line 412, in _fvisc_divergence_flux_boundary
    return project(
  File "/shared/home/mdiener/Work/efuse2/grudge/grudge/projection.py", line 68, in project
    return map_array_container(
  File "/shared/home/mdiener/Work/efuse2/arraycontext/arraycontext/container/traversal.py", line 238, in map_array_container
    return deserialize_container(ary, [
  File "/shared/home/mdiener/Work/efuse2/arraycontext/arraycontext/container/traversal.py", line 239, in <listcomp>
    (key, f(subary)) for key, subary in iterable
  File "/shared/home/mdiener/Work/efuse2/grudge/grudge/projection.py", line 72, in project
    return dcoll.connection_from_dds(src, tgt)(vec)
  File "/shared/home/mdiener/Work/efuse2/meshmode/meshmode/discretization/connection/direct.py", line 573, in __call__
    check_dofarray_against_discr(self.from_discr, ary)
  File "/shared/home/mdiener/Work/efuse2/meshmode/meshmode/dof_array.py", line 869, in check_dofarray_against_discr
    raise InconsistentDOFArray(
meshmode.dof_array.InconsistentDOFArray: DOFArray group 0 array has unexpected shape. (observed: (186644, 3), expected: (16293, 3))
MTCam commented 2 years ago

Following these instructions currently results in the following error, so I can't debug this further:

Sorry bout that, please pull mirgecom@production and try again. This off-driver was broken by changes in fluxing infrastructure and should be fixed now.

matthiasdiener commented 2 years ago

Pulling in https://github.com/inducer/loopy/pull/602 should fix this.

MTCam commented 2 years ago

Pulling in inducer/loopy#602 should fix this.

I spoke too soon about this earlier. Your branch does fix the 2-rank issue. Running with more than 2 ranks fails in the same way it failed before. The warning that gets spit out just before the JIT failure is this one:

/p/gpfs1/mtcampbe/CEESD/AutomatedTesting/MIRGE-Timing/timing/emirge/miniforge3/envs/isolator.lazy.timing.env/lib/python3.9/site-packages/pyopencl/invoker.py:366: UserWarning: Kernel '_pt_kernel_1' has 505 arguments with a total size of 4040 bytes, which approaches the limit of 4352 bytes on <pyopencl.Device 'Tesla V100-SXM2-16GB' on 'Portable Computing Language' at 0x10256f1f8>. This might lead to compilation errors, especially on GPU devices.