TPA-LSTM trains slower on GPU than on CPU

sdobber / FluxArchitectures.jl

Complex neural network examples for Flux.jl

MIT License

122 stars 15 forks source link

TPA-LSTM trains slower on GPU than on CPU #15

Open sdobber opened 3 years ago

sdobber commented 3 years ago

Apparently, the construction with Flux.unstack and Flux.stack is much slower than the 'slow' Zygote.Buffer. The latter cannot be used on the GPU due to missing support for array mutation.

DhairyaLGandhi commented 3 years ago

Probably open an issue on Flux.jl with an MWE?

sdobber commented 3 years ago

Which issue are you referring to? The fact that stack and unstack are slower, or that Zygote.Buffer does not work? I will try to get a MWE together (though it might take a bit due to another project I need to work on). I have to admit that the mistake could as well be completely on my side. I'm new to GPU programming, so I'm still learning a lot while slowly moving forward, and I'm probably still doing a lot of things the wrong way 😄

DhairyaLGandhi commented 3 years ago

Both? I'm happy to help with network architectures as well. Btw, did you check that we added a reference to this repo on the flux site https://fluxml.ai/ecosystem.html#advanced-models

sdobber commented 3 years ago

Note to myself: MWE for Zygote.Buffer segfault on GPU:

using Flux
inp = rand(Float32, 137, 10, 1000) |> gpu
B = Flux.Zygote.Buffer(inp, 137,9,1000) 
t = 1
x = inp[:,t,:]
B[:,t,:] = x