Parallelisation - Githubissues

stellarpower commented 7 months ago

Hi,

Was wondering - for each element in the batch, does the current algorithm automatically parallelise? I have an RTX3090 (with 24 GiB) and I run out of memory instantly for a sequence anything longer than 512 samples.

I was wondering if CUDA is trying to parallelise across each sequence in the batch automatically - if so, I think it's be good to run them in series if there isn't enough memory, seeing as they should be independent. It seems DTW inherently has high memory use, and I'd rather have the loss take longer than be limited in my sequence length if this is the case and that's possible.

Cheers

toinsson commented 7 months ago

Hello @stellarpower , if you are running into memory issues, I would recommend to reduce the batch size until your data fits into GPU memory. With a batch of one and a length of 512 (i.e. [1, 512]), you do not run into this issue, right?

stellarpower commented 7 months ago

I think I don't, although if the sequences in the batch are independent, is it possible to do it in such a way that they don't all consume the resources if there's room to loop through one by one (or several at a time) instead? I.e. if I have a batch size of 16, and there's memory to compute 8 at a time, can we do two iterations and then just sum the result?

I think the problem I hit after that, which was more of a concern, was running out of resources for longer sequences - the overall GPU memory was fine, but, I assume the thread limit or memory for the kernel was exhausted, with a batch size of 1.

toinsson commented 7 months ago

With regards to batching batched input depending on the memory available on the host machine, that will not be implemented in pysdtw: too complex and out-of-scope. I would rather have the host making sure that they make calls with adapted batched size.

For the a longer sequence and a batch of one, it should also be the GPU memory limitation. It is a obviously a limit of this algorithm, and I guess there are some workarounds, but implementing them would represent a sizeable amount of work. It is not planned for the moment but if you want to investigate that, I'll be happy to review a PR.

Unless that is the case, I'll close this issue?

stellarpower commented 7 months ago

I would rather have the host making sure that they make calls with adapted batched size.

Afraid I'm not sure what you mean by adapted batch size; does Torch change the batch size dynamically at times? I have not seen this in Tensorflow.

But this is what I was wondering, is there a motivating reason for handling all the sequences in the batch in the kernel? I would've thought if the kernel just handles one, then the framework would be able to handle sensible decisions about how to parallelise them and queue up in pipelines, freeing it from being an issue. I haven't been able to stress-test the Keras one yet, but I am hoping that as I handle each sequence separately, given Tensorflow works on the atomic unit of "Ops", it will be smart enough to queue everything up in a way that prevents memory exhaustion when it knows it can parallelise. At the top level I have a map over the sequences in the batch, so it seems reasonably hopeful that will be the case.

Are longer sequences limited by the CUDA kernel/thread memory limits mentioned in Maghoumi's version? Or just by total available memory? For particular my network, a sequence length of 512 will be quite limiting, so I may still look into writing a kernel in C++ if it learns but needs further optimisations.

Unless that is the case, I'll close this issue?

Feel free, I don't know if you want to enable the Q&A feature on the repository, as was more of a question right now than a request, and this could be moved over there. I think a limited batch length could be an issue for my project, but as I'm mostly using Tensorflow then I don't have as much a reason myself to need it for the meantime.

BTW whilst I am here, do you think it would be possible to add a quick example of obtaining the full gradients to the readme? I am getting a difficult bug where my implementation is differing from yours and the other Torch versions in the gradients, but only when gamma != 1.0. The intermediate matrices are all within numerical precision, so I'm not sure if I am using the gradient tape properly. This is what I have, I am looking at the tests right now to see if it looks the same:

lossesPerSequenceGraph = sdtw(y_true, y_pred)

lossesPerSequence = lossesPerSequenceGraph.detach().numpy()

# Have to call sum before running the backwards pass.
lossesPerSequenceGraph.sum().backward()

# We only want the gradient on y_pred; y_true is constant and thus doesn't affect them.
torchGradients = y_pred.grad

Thanks

stellarpower commented 7 months ago

Edit: Looks like I have to call y_pred.grad.zero_() explicitly after calculating for each value of gamma; then the numbers match between the two implementations, so I had it right in the Keras version all along. That was a very annoying loss of a day :rofl:

stellarpower commented 7 months ago

Am I also missing something on the memory use?

Let's say I have data of (16, 512, 1). For each sequence, with float32 that's then:

just over 1M for the loss matrix in the forward pass
same for the distance matrix
same for the alignment matrix

So if we do everything at the same time that's to the order of megabytes per batch, and yet I'm able to exhaust the 24G of memory on my 3090 by increasing either the batch size or the sequence length. So presumably I've missed something in this model of what the algorithm is doing (?)

toinsson / pysdtw

Parallelisation #5