How do you remat GSPMD inserted all-gathers?

jax-ml / jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

Apache License 2.0

30.56k stars 2.81k forks source link

activation = jax.lax.with_sharding_constraint(activation, NamedSharding(mesh, PartitionSpec('data', 'tensor', None)) activation = norm(activation) activation = jax.lax.with_sharding_constraint(activation, NamedSharding(mesh, PartitionSpec('None, 'tensor', None)) # I want to remat this one ^ activation = attention(activation)

Thanks for the question.

No, I don't think a new pass is needed.

As I understand it, the standard way to spell this is to us a remat policy to mark the with_sharding_constraint which induces the allgather as not-saveable. One way to do that would be to use save_only_these_names and to only name other arrays (that are either upstream of the allgather-inducing with_sharding_constraint, or downstream of the operations that use the output of attention). Following your snippet, that might look something like:

activation = jax.lax.with_sharding_constraint(activation, NamedSharding(mesh, PartitionSpec('data', 'tensor', None))
activation = checkpoint_name(norm(activation), 'scattered_activations')
activation =  jax.lax.with_sharding_constraint(activation, NamedSharding(mesh, PartitionSpec('None, 'tensor', None))
activation = attention(activation)

together with a save_only_these_names policy that mentions 'scattered_activations' or something upstream of it.

Did you try something like that? If you already tried it, we should put together a minimal example to debug what's going on.

jax-ml / jax

How do you remat GSPMD inserted all-gathers? #25010