Choice of ops for stencil computations run on TPUs

Hi, I am currently investigating possibilities of using JAX for scientific computation on TPUs. Thanks to the excellent work in a JAX tutorial (https://github.com/google/jax/blob/master/cloud_tpu_colabs/Wave_Equation.ipynb), it helped me understand how to use JAX and its advantages more quickly. However, one of the questions I have is why the performance of using convolution-based ops for stencil computations is much worse than that of using the element-wise ops on Cloud TPU.

To take advantage of the compute power of MXU in TPU, I use two 1D convolution ops for the stencil computation in analogy to the element-wise ops. The following snippets are for 5-point stencil computations in 2D problems.

Element-wise ops:

left = shift(array, +1, axis=0)
right = shift(array, -1, axis=0)
up = shift(array, +1, axis=1)
down = shift(array, -1, axis=1)
convolved = (left + right + up + down - 4 * array)

Convolution-based ops:

col_F = make_kernel([[1., -2., 1.]])
row_F = make_kernel([[1.,],
                     [-2.,],
                     [1.]])
dn_col = lax.conv_dimension_numbers(array.shape, col_F.shape,('NHWC', 'HWIO', 'NHWC'))
dn_row = lax.conv_dimension_numbers(array.shape, row_F.shape,('NHWC', 'HWIO', 'NHWC'))
col_ops = lax.conv_general_dilated(array, col_F, (1,1),'SAME', (1,1), (1,1),  dn_col) 
row_ops = lax.conv_general_dilated(array, row_F, (1,1),'SAME', (1,1), (1,1),  dn_row) 
convolved = (col_ops+row_ops)[0,:,:,0]

google / jax

Choice of ops for stencil computations run on TPUs #3341