Cuda out of memory - Githubissues

Arman-IMRSV commented 3 years ago

Hello. I am trying to reproduce the paper results. I am currently running the code on 2 Tesla V100 GPUs each containing 16GB of memory, but still I am getting out-of-memory error. I also tried to decrease MAX_TRANSCRIPT_WORD to 1000, but it did not help. Could you please let me what hardware and GPU it requires to run?

Arman-IMRSV commented 3 years ago

@xrc10

ilyaivensky commented 3 years ago

The same story. Running with 4 Quadro RTX 6000, each with 24GB of memory

xrc10 commented 3 years ago

We used V100 GPU with 32GB of memory. Unfortunately, I haven't tried it with other GPUs. Can you also try to decrease MAX_SENT_LEN and MAX_SENT_NUM to smaller values to see if the OOM eror still occurs?

Arman-IMRSV commented 3 years ago

Thanks @xrc10 for the response. I had tried decreasing those parameters, but didn't help.

omelnikov commented 3 years ago

Thanks @xrc10 for the response. I had tried decreasing those parameters, but didn't help.

Hi @Arman-IMRSV , good observations! Could you clarify what parameter values you have tried and decreased from what to what? Judging by the difference in GPU memory sizes, the change in parameters needs to produce batches about half the (byte) size of those used by the authors. Note that some GPU memory is used by its own tasks, so not all 24GB is available for training batches. Also, have you used the same training set? Sentence length will vary on different corpora. The byte size can also be estimated from the average character length of the batch. I'm also curious if you investigated the batch that caused the memory crash. Was it the first batch? What was the size of the batch (in bytes), etc.? You might also try a lower precision of your tensors versus that used in the paper. Try exploring the memory-crashing batch in greater detail. I hope it works out, but do tell what you discover. It helps others to reproduce with fewer glitches on a different hardware.

shonaviso commented 2 years ago

Hi @Arman-IMRSV I am facing the above issue while evaluating. Is the case same for you also?

microsoft / HMNet

Cuda out of memory #5