Resharding across MGPUs results in long series of cuMemAlloc_v2 calls

Description

I'm implementing a time marching simulation across multiple GPUs. The calculation has field arrays sharded in one axis, and operators sharded in another (actual implementation involves ffts constrained using experimental.custom_partitioning, but I've omitted that here for simplicity). I use lowerand compilefor AOT compilation to make it easy for me to benchmark actual runtime.

What I'm seeing is that on first execution a long time is spent executing a series of cuMemAlloc_v2 calls on each device in a series of streams of the form Stream #N(Memset). The time this takes appears to grow with the square of the GPUs. For the minimal example below I see the following:

2 GPU first call takes 2.2 s
4 GPU first call takes 7.8 s
8 GPU first call takes 29 s
16 GPU first call takes 118 s

My questions:

Are these calls expected?
If so, is it expected that they should take so long?

Below is a minimal example. In practice I'm using donate_argnames and specifying in_shardings and out_shardings, but have omitted here for brevity.

Any help would be much appreciated!

Minimal example:

import jax
import jax.numpy as jnp
from jax import jit
from jax.lax import fori_loop
from jax.experimental import mesh_utils
from jax.sharding import Mesh, PartitionSpec as P, NamedSharding
from time import perf_counter

# inputs
ngpu = 2
dims = (8192, 8192)
nt = 100    

# define a single time step
def run_step(i, carry):

    # unpack fields
    flda, fldb, opa, opb = carry

    # calculations
    fldb = opa * flda  
    flda = opb * fldb

    return (flda, fldb, opa, opb)

# define run function
def run(nt, carry):
    return fori_loop(0, nt, run_step, carry)

# create mesh
devices = mesh_utils.create_device_mesh((ngpu,), jax.devices()[0:ngpu])
mesh = Mesh(devices, axis_names=("gpus",))
shard_y = NamedSharding(mesh, P(None, "gpus"))
shard_x = NamedSharding(mesh, P("gpus", None))

# begin trace
with jax.profiler.trace("./tensorboard"):

    # create operators & fields
    tbeg = perf_counter()
    opa = jax.device_put(jnp.ones(dims), shard_x)
    opb = jax.device_put(jnp.ones(dims), shard_x)       
    flda = jax.device_put(jnp.ones(dims), shard_y)
    fldb = jax.device_put(jnp.ones(dims), shard_y)
    carry = (flda, fldb, opa, opb)
    trun = perf_counter() - tbeg
    print(f"Array creation time:\t{1e3*trun:8.1f} ms")

    # compile 
    tbeg = perf_counter()
    run_jit = jit(run, static_argnames=("nt"))
    lowered = run_jit.lower(nt, carry)
    compiled = lowered.compile()
    trun = perf_counter() - tbeg
    print(f"Compile time:\t\t{1e3*trun:8.1f} ms")

    # single step warmup run
    tbeg = perf_counter()
    _ = run_jit(1, carry)
    flda.block_until_ready()
    trun = perf_counter() - tbeg
    print(f"Single step run time:\t{1e3*trun:8.1f} ms")

    # run the full calculation
    tbeg = perf_counter()
    flda, fldb, _, _ = run_jit(nt, carry)
    flda.block_until_ready()
    trun = perf_counter() - tbeg
    print(f"Run time:\t\t{1e3*trun:8.1f} ms")

Example trace:

What jax/jaxlib version are you using?

jax==0.4.20, jaxlib==0.4.20+cuda11.cudnn86

Which accelerator(s) are you using?

GPU (16x Nvidia A100 40GB, but can recreate on 2x)

Additional system info?

1.26.2 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0] uname_result(system='Linux', node='a2-mega-v1', release='5.10.0-26-cloud-amd64', version='#1 SMP Debian 5.10.197-1 (2023-09-29)', machine='x86_64')

NVIDIA GPU info

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

google / jax