jipolanco / PencilArrays.jl

Distributed Julia arrays using the MPI protocol
https://jipolanco.github.io/PencilArrays.jl/dev/
MIT License
60 stars 8 forks source link

Broadcasting and scalar indexing in the case of CuArrays #42

Closed corentin-dev closed 2 years ago

corentin-dev commented 2 years ago

Hi, Good progress was made in term of GPU support. Some problem still arise when using PencilArrays. For example the following code fails:

## MPI
using MPI

using PencilArrays

MPI.Init()
comm = MPI.COMM_WORLD
rank_ = MPI.Comm_rank(comm)
size_ = MPI.Comm_size(comm)

topo_dims2 = [0]
MPI.Dims_create!(size_,topo_dims2)
topo2 = MPITopology(comm, Tuple(topo_dims2))
pen2_x = Pencil(topo2, (128,128))

xp2p = PencilArray(pen2_x, Array{Complex{Float32}}(undef, size_local(pen2_x)));

rx,ry = axes(global_view(xp2p));
nxl,nyl = size_local(xp2p);
xxp = Array(reshape(LinRange(0,1,128)[rx],nxl,1));
yyp = Array(reshape(LinRange(0,1,128)[ry],1,nyl));
@. xp2p = sin(2π * xxp)*sin(2π * yyp);

## CUDA
using CUDA

# create pencil
pen2g_x = Pencil(CuArray, topo2, (128,128))
# create array
xp2g = PencilArray{Complex{Float32}}(undef, pen2g_x);

# restriction
rx,ry = axes(global_view(xp2g));
nxl,nyl = size_local(xp2g);
xxg = CuArray(reshape(LinRange(0,1,128)[rx],nxl,1));
yyg = CuArray(reshape(LinRange(0,1,128)[ry],1,nyl));
@. xp2g = sin(2π * xxg)*sin(2π * yyg);

The error mention materialize!:

 [1] error(s::String)
    @ Base ./error.jl:33
  [2] assertscalar(op::String)
    @ GPUArrays ~/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:53
  [3] getindex(::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64)
    @ GPUArrays ~/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:86
  [4] getindex
    @ ~/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:107 [inlined]
  [5] _broadcast_getindex
    @ ./broadcast.jl:636 [inlined]
  [6] _getindex
    @ ./broadcast.jl:667 [inlined]
  [7] _getindex
    @ ./broadcast.jl:666 [inlined]
  [8] _broadcast_getindex
    @ ./broadcast.jl:642 [inlined]
  [9] _getindex
    @ ./broadcast.jl:667 [inlined]
 [10] _broadcast_getindex
    @ ./broadcast.jl:642 [inlined]
 [11] _getindex
    @ ./broadcast.jl:666 [inlined]
 [12] _broadcast_getindex
    @ ./broadcast.jl:642 [inlined]
 [13] getindex
    @ ./broadcast.jl:597 [inlined]
 [14] macro expansion
    @ ./broadcast.jl:961 [inlined]
 [15] macro expansion
    @ ./simdloop.jl:77 [inlined]
 [16] copyto!
    @ ./broadcast.jl:960 [inlined]
 [17] copyto!
    @ ./broadcast.jl:913 [inlined]
 [18] materialize!
    @ ./broadcast.jl:871 [inlined]
 [19] materialize!(dest::PencilArray{ComplexF32, 2, CuArray{ComplexF32, 2, CUDA.Mem.DeviceBuffer}, 2, 0, Pencil{2, 1, NoPermutation, CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer}}}, bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(sin), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(*), Tuple{Int64, Irrational{:π}}}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(sin), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(*), Tuple{Int64, Irrational{:π}}}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}}}}}})
 [20] @ Base.Broadcast ./broadcast.jl:868

I am not sure why it fails. The same kind of line with CuArray instead of PencilArray work.

Thanks a lot! Corentin

jipolanco commented 2 years ago

Hi, indeed it seems to be a problem with broadcasting.

I've made some progress trying to fix this but it's quite tricky, especially when combined with dimension permutations! I'll try to push a fix soon.

corentin-dev commented 2 years ago

Thanks a lot!

corentin-dev commented 2 years ago

I don't know if you are aware, but this commit changed the order for broadcasting:

# size nx ny nz
plan = PencilFFTPlan(pen_array, Transforms.FFT())
x = LinRange(0,2*pi,nx)
y = LinRange(0,2*pi,ny)
z = LinRange(0,2*pi,nz)

# temporary arrays
pen_hat = allocate_output(plan)

# axes
pen_hat_glob = global_view(pen_hat)

# ranges
rx_hat, ry_hat, rz_hat = axes(pen_hat_glob)
nxl_hat, nyl_hat, nzl_hat = size_local(pen_hat)

# range compatible to all types of Array
rrx_hat = rx_hat[begin]:rx_hat[end]
rry_hat = ry_hat[begin]:ry_hat[end]
rrz_hat = rz_hat[begin]:rz_hat[end]

# reshape hat versions of positions
# what used to work
x_hat = reshape(x[rrx_hat],nxl_hat,1,1)
y_hat = reshape(y[rry_hat],1,nyl_hat,1)
z_hat = reshape(z[rrz_hat],1,1,nzl_hat)
@. pen_hat = x_hat * pen_hat

# what works now
x_hat = reshape(x[rrx_hat],1,1,nxl_hat)
y_hat = reshape(y[rry_hat],1,nyl_hat,1)
z_hat = reshape(z[rrz_hat],nzl_hat,1,1)
@. pen_hat = x_hat * pen_hat

The workaround works, but maybe you have some suggestions on how to do it better?

jipolanco commented 2 years ago

I guess on the last line of your example you meant something like

@. pen_hat = x_hat * y_hat * z_hat * pen_hat

which uses the three *_hat arrays?

I'm not that surprised by this change of behaviour, but I agree it's not ideal, and the original version of your example should be the good one. Sorry about that, I'm not that familiar with numpy-style broadcasting using singleton dimensions, but I see that it can be very convenient and especially useful on GPUs. I'm reopening this issue and I'll try to have this fixed.

Note that the order changes only when working with the output of transforms in PencilFFTs. This is done to make sure that 1D FFTs are always done contiguously in memory. This can be disabled by passing permute_dims = Val(false) to PencilFFTPlan (docs are here). I'm not sure how this will impact performance, or whether it will actually work on CuArrays.

Another workaround (which will stop working if this issue is fixed) is to do something like:

x_hat = reshape(x[rrx_hat], permutation(pen_hat) * (nxl_hat,1,1))
y_hat = reshape(y[rry_hat], permutation(pen_hat) * (1,nyl_hat,1))
z_hat = reshape(z[rrz_hat], permutation(pen_hat) * (1,1,nzl_hat))

which will automatically permute the array dimensions according to the order of indices in pen_hat.


As a side note, instead of using global_view to get the range of local indices, it would be more convenient to use range_local, which directly returns basic ranges of the form (i1:i2, j1:j2, k1:k2). I'll try to mention that somewhere in the docs.

jipolanco commented 2 years ago

Hi Corentin, I've been playing around with an alternative way of working with grids which may "solve" this issue. The idea is to add a localgrid function that lets you easily work with the local part of the grid (associated to the local MPI process). I've been wanting to do this for a while, and now it seemed like a good opportunity :)

The details are in #45. Basically, in your example, this would allow you to do:

grid = localgrid(pen_hat, (x, y, z))
@. pen_hat = (grid.x + grid.y + grid.z) * pen_hat

Note that one doesn't need to care about possible index permutations and things like that. Everything is taken care of in the grid object. I still want to finish some things before merging #45, but of course let me know if you have any comments or ideas!

corentin-dev commented 2 years ago

This would be really (really!) convenient! Broadcasting is really useful to have a "generic" code without indexing. I guess grid.x will have the same shape as x? Will it work with names like this grid = localgrid(pen_hat, (ξx,ξy,ξz)) and grid.ξx, etc...? It should work with the second notation grid[1] in any case.

corentin-dev commented 2 years ago

Just to be sure, will localgrid returns an array of type typeof(x) (or interpretable as such)?

jipolanco commented 2 years ago

I guess grid.x will have the same shape as x?

Yes. The difference is that during broadcasting it would be reshaped similarly to what you did in your example.

Will it work with names like this grid = localgrid(pen_hat, (ξx,ξy,ξz)) and grid.ξx, etc...?

No, in that case it would still be grid.x, etc... It's the same convention used by StaticArrays (e.g. SVector(2, 3, 1).x == 2). I guess changing the name would be possible using macros, but for now I prefer to keep things simple...

Just to be sure, will localgrid returns an array of type typeof(x) (or interpretable as such)?

Not sure if I understand the question, but localgrid returns a LocalRectilinearGrid (also defined in #45), which can be indexed as if it was an N-dimensional array. For example grid[2, 3, 5] returns a coordinate (x, y, z).

corentin-dev commented 2 years ago

Just to be sure, will localgrid returns an array of type typeof(x) (or interpretable as such)?

Not sure if I understand the question, but localgrid returns a LocalRectilinearGrid (also defined in https://github.com/jipolanco/PencilArrays.jl/pull/45), which can be indexed as if it was an N-dimensional array. For example grid[2, 3, 5] returns a coordinate (x, y, z).

My concern is that it stays a CuArray. I guess I'll just try to see what it gives me!

jipolanco commented 2 years ago

Right. It should stay a CuArray, since no new arrays are created. But let me know if that's not the case!

jipolanco commented 2 years ago

I just merged #45, which will be included in v0.15 (being registered right now). So I'm closing this for now, but let me know if there are still issues!

corentin-dev commented 2 years ago

Everything seems to work, for broadcasting, I'll open an another issue with CuArrays due to CuPtr and Isend.