`build program` times increasing with rank count on Mac when caching is enabled

majosm commented 2 months ago

The times reported by the build program: kernel '<name>' was part of a lengthy source build resulting from a binary cache miss (<time>) output appear to increase fairly dramatically with rank count on my Mac with caching enabled, even when using rank-local cache directories. For example, when running the wave-op-mpi example in grudge, with 16 ranks and caching disabled via PYOPENCL_NO_CACHE=1, I see:

INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.31 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.31 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.32 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.32 s)

With caching enabled (and empty cache) I see:

INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (0.49 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (0.68 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.94 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (3.28 s)

(Note: rhs is missing from the first output, presumably because the time is below the output threshold. The lack of frozen_nodes0_2d in the second output is confusing though.)

If I increase to 16 ranks, with no caching I see:

INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.52 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.54 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.56 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.55 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.55 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.56 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.53 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.56 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.51 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.53 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.56 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.53 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.51 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.52 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.59 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.61 s)

(again no rhs). And with caching I see:

... truncated ...
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.07 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.14 s)
INFO:pyopencl:build program: kernel 'frozen_result' was part of a lengthy source build resulting from a binary cache miss (1.18 s)
INFO:pyopencl:build program: kernel 'frozen_result' was part of a lengthy source build resulting from a binary cache miss (1.11 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.13 s)
INFO:pyopencl:build program: kernel 'frozen_result' was part of a lengthy source build resulting from a binary cache miss (1.19 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.21 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.22 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.18 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (4.17 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (3.31 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (4.31 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.20 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (3.27 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.22 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (4.19 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.84 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.82 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (3.13 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (9.33 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (10.46 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (10.64 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (11.14 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (11.09 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (11.25 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (11.41 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (12.45 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (12.55 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (13.31 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (12.81 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (12.98 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (14.04 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (14.83 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (14.18 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (14.30 s)
... truncated ...

(full build program output here).

If I profile with pyinstrument, I see an increase in time spent in Program.build inside grudge's _DistributedCompiledFunction.__call__. Here's the profiling output without caching:

               │     ├─ 38.014 _DistributedCompiledFunction.__call__  grudge/array_context.py:413
               │     │  ├─ 23.718 execute_distributed_partition  pytato/distributed/execute.py:103
               │     │  │  ├─ 14.778 wait_for_some_recvs  pytato/distributed/execute.py:185
               │     │  │  │  ├─ 12.957 [self]  pytato/distributed/execute.py
               │     │  │  │  └─ 1.806 to_device  pyopencl/array.py:2329
               │     │  │  └─ 8.359 exec_ready_part  pytato/distributed/execute.py:164
               │     │  │     ├─ 4.575 BoundPyOpenCLExecutable.__call__  pytato/target/loopy/__init__.py:305
               │     │  │     │  └─ 4.350 PyOpenCLExecutor.__call__  loopy/target/pyopencl_execution.py:349
               │     │  │     │     ├─ 2.344 wrapper  pytools/__init__.py:768
               │     │  │     │     │  └─ 2.339 PyOpenCLExecutor.translation_unit_info  loopy/target/pyopencl_execution.py:302
               │     │  │     │     └─ 1.935 PicklableFunction.__call__  pytools/py_codegen.py:147
               │     │  │     └─ 3.544 Array.get  pyopencl/array.py:890
               │     │  │           [4 frames hidden]  pyopencl, <built-in>

and with:

               │     ├─ 49.067 _DistributedCompiledFunction.__call__  grudge/array_context.py:413
               │     │  ├─ 38.637 execute_distributed_partition  pytato/distributed/execute.py:103
               │     │  │  ├─ 20.316 exec_ready_part  pytato/distributed/execute.py:164
               │     │  │  │  ├─ 16.636 BoundPyOpenCLExecutable.__call__  pytato/target/loopy/__init__.py:305
               │     │  │  │  │  └─ 16.391 PyOpenCLExecutor.__call__  loopy/target/pyopencl_execution.py:349
               │     │  │  │  │     ├─ 14.543 wrapper  pytools/__init__.py:768
               │     │  │  │  │     │  └─ 14.537 PyOpenCLExecutor.translation_unit_info  loopy/target/pyopencl_execution.py:302
               │     │  │  │  │     │     └─ 12.608 Program.build  pyopencl/__init__.py:505
               │     │  │  │  │     │           [6 frames hidden]  pyopencl, <built-in>
               │     │  │  │  │     └─ 1.775 PicklableFunction.__call__  pytools/py_codegen.py:147
               │     │  │  │  └─ 3.462 Array.get  pyopencl/array.py:890
               │     │  │  │        [4 frames hidden]  pyopencl, <built-in>
               │     │  │  └─ 17.774 wait_for_some_recvs  pytato/distributed/execute.py:185
               │     │  │     ├─ 15.945 [self]  pytato/distributed/execute.py
               │     │  │     └─ 1.806 to_device  pyopencl/array.py:2329
               │     │  └─ 8.677 _args_to_device_buffers  arraycontext/impl/pytato/compile.py:524
               │     │     ├─ 6.204 MPIFusionContractorArrayContext.freeze  arraycontext/impl/pytato/__init__.py:429
               │     │     │  └─ 6.112 PyCapsule.wait  <built-in>
               │     │     └─ 2.286 to_device  pyopencl/array.py:2329
               │     │           [4 frames hidden]  pyopencl, <built-in>

Here's the script I'm using to run the example:

#!/bin/bash

if [[ -n "$OMPI_COMM_WORLD_NODE_RANK" ]]; then
    # Open MPI
    RANK_ID="rank${OMPI_COMM_WORLD_RANK}"
elif [[ -n "$MPI_LOCALRANKID" ]]; then
    # mpich/mvapich
    RANK_ID="rank${MPI_LOCALRANKID}"
fi

export POCL_CACHE_DIR=".cache/pocl_${RANK_ID}"
export XDG_CACHE_HOME=".cache/xdg_${RANK_ID}"

python -m mpi4py wave-op-mpi.py --lazy
# pyinstrument -o "pyinstrument/${RANK_ID}.txt" -m mpi4py wave-op-mpi.py --lazy

(run with rm -rf .cache && mpiexec -n 4 bash run.sh.)

I haven't been able to try running this on Lassen yet to see if I get the same behavior there; I'm currently running into some environment issues.

cc @matthiasdiener

majosm commented 2 months ago

Here's a breakdown of what's happening inside Program.build:

      │     ├─ 11.333 Program.build  pyopencl/__init__.py:505
      │     │  └─ 11.333 Program._build_and_catch_errors  pyopencl/__init__.py:554
      │     │     └─ 11.333 <lambda>  pyopencl/__init__.py:536
      │     │        └─ 11.333 create_built_program_from_source_cached  pyopencl/cache.py:489
      │     │           └─ 11.333 _create_built_program_from_source_cached  pyopencl/cache.py:341
      │     │              ├─ 11.186 PyCapsule.get_info  <built-in>
      │     │              ├─ 0.145 _Program.program_build  pyopencl/__init__.py:735
      │     │              │  └─ 0.145 PyCapsule._build  <built-in>
      │     │              └─ 0.001 retrieve_from_cache  pyopencl/cache.py:265
      │     │                 └─ 0.001 isdir  <frozen genericpath>:39
      │     │                    └─ 0.001 stat  <built-in>

The slowdown appears to be coming from these calls. Timing the two separately, it looks like the second one specifically is to blame.

matthiasdiener commented 2 months ago

I think those get_info calls just trigger the actual build downstream (ie., pocl). Do they not show up in the uncached build (maybe in a different spot)?

majosm commented 2 months ago

Based on @matthiasdiener's comment and our discussion this morning, I made some more measurements, this time on the whole compile time. Specifically, I compared the first step time of grudge wave for:

Caching enabled (i.e., not setting PYOPENCL_NO_CACHE). This is the path that calls create_built_program_from_source_cached and reads/writes cache. Note: For this test I disabled cache reading to simulate a completely cold cache (and eliminate cache reads resulting from cache writes in the same execution, which somehow does seem to happen).
Caching disabled (setting PYOPENCL_NO_CACHE=1). This path just calls prg.build(...) directly.

If I understand correctly, the main time difference between these should come down to the cache writing time. Here's what I see (same setup as before, with rank-local cache dirs; also, I am manually applying the changes from #716, which don't seem to have made it to the version on conda yet):

plot1

The scaling is not good, but could be due to DAG splat. Additionally, it seems as if the cache writing is taking a lot of time. However, if I add a (unused) call to get_info(BINARIES) in the non-cache version I see this:

plot2

which suggests that most of the time is coming from the get_info call, not the actual cache writing. Does this make sense? Is get_info(BINARIES) doing something inefficient?

inducer commented 2 months ago

Is get_info(BINARIES) doing something inefficient?

It sure looks that way. It might require duplicate compilation in pocl? (I'm not sure where, but your second graph is enough for me.) Based on this, I think we should definitely turn off pyopencl's CL binary caching for pocl. PR?

It might also be worthwhile to understand what pocl is doing under the hood.

matthiasdiener commented 2 months ago

I think what happens is the following:

Example pyopencl code:

import numpy as np

import pyopencl as cl
import pyopencl.array as cl_array

rng = np.random.default_rng()
a = rng.random(50000, dtype=np.float32)
b = rng.random(50000, dtype=np.float32)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

a_dev = cl_array.to_device(queue, a)
b_dev = cl_array.to_device(queue, b)
dest_dev = cl_array.empty_like(a_dev)

prg = cl.Program(ctx, """
    __kernel void sum(__global const float *a,
    __global const float *b, __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] + b[gid];

    """ + "c[gid] = a[gid] + b[gid];"*1000 + "}"
    ).build()

knl = prg.sum  # Use this Kernel object for repeated calls
knl(queue, a.shape, None, a_dev.data, b_dev.data, dest_dev.data)

assert np.allclose(dest_dev.get(), a + b)

Without get_info(BINARIES) (i.e., cache disabled): only 1 kernel gets compiled by pocl (via pocl_llvm_codegen):
```
./xdg-cache/pocl/kcache/CJ/MFPEIKHNIJCGFPPDEFBAJGFMDDFFEFBLDLAHM/sum/2000-1-1-goffs0-smallgrid/sum.so
```

When get_info(BINARIES)(i.e., cache enabled) is called, it also compiles a generic version via pocl_driver_build_poclbinary

./xdg-cache/pocl/kcache/CJ/MFPEIKHNIJCGFPPDEFBAJGFMDDFFEFBLDLAHM/sum/2000-1-1-goffs0-smallgrid/sum.so
./xdg-cache/pocl/kcache/CJ/MFPEIKHNIJCGFPPDEFBAJGFMDDFFEFBLDLAHM/sum/0-0-0/sum.so

I haven't found a way to disable this behavior.

inducer commented 2 months ago

Thanks for doing more digging here, @matthiasdiener! While we didn't decode that a "generic" kernel was being built, we did track down pocl_driver_build_poclbinary and concluded that it would likely trigger a compile and that, given @majosm's measurements, that compile was in addition to the "normal" from-source-for-execution build.

Important question: are all these conclusions still valid for the Nvidia target? They seem device-unspecific, but I don't know how a generic kernel would be different from a size-specific one in the GPU case.

At any rate, at least for CPU, we can probably save time by skipping pyopencl's binary cache if we're working with pocl.

majosm commented 2 months ago

Seems like the time spent in get_info(BINARIES) is much higher for CPUs than it is for GPUs. For combozzle on Lassen I'm seeing sub-millisecond times when running on the GPU, and up to 40s when running on the CPU.

inducer / pyopencl

`build program` times increasing with rank count on Mac when caching is enabled #731