Open sboehringer opened 3 months ago
Can you please paste the error stack here to investigate? Do you run out of GPU memory or RAM?
It is GPU RAM. Please find the log below (wrapped as captured from a tmux session). I can also provide a self-contained example for reproduction, if helpful.
I should add that the suggested fix from the output (TF_GPU_ALLOCATOR=cuda_malloc_async
) did not change anything.
Given that the errors occurs only after 50 or 60 epochs, it looks as if the memory accumulates somehow, though I don't see why or where this might be. Does the code run fast enough to test it without a GPU? If it does, could you run it on CPU only (using $ CUDA_VISIBLE_DEVICES='' ./bayesFlow.py --regress linearMVmi
) and monitor RAM usage to see if the memory usage increases with epochs? Does it run out of memory in that case as well? This would help to pinpoint whether the error is GPU related or more general.
If this is not possible, a self-contained example for reproduction would be a great help.
Are you running the training from a script or from a Jupyter notebook?
Are you running the training from a script or from a Jupyter notebook?
The command is at the top of the log, it is a training script
I see. I would need to run it on my GPU workstation to reproduce the problem.
@vpratz: here is memory usage using the CPU
End Epoch 4: 2.564g
Beginning Epoch 5: 2.682g
~ 100/1000 Iterations: 2.690g
End Epoch 5: 2.693g
Memory from the previous epoch seems not to be freed/reused. Then there is some further initial consumption of 8Mb after which memory usage becomes stable. This pattern seems to repeat for ensuing epochs.
These numbers do seem to be compatible with what I see under the GPU as 7.5G/130M would equate to ~ 57 epochs until OOM.
I will put together a self-contained example for reproduction. Thank you.
Here is an example that has been tested for OOM. bayesFlow-debug.txt
It can be run straight without arguments. It shouldn't touch the disk (read/write).
Thanks, I will investigate if that's a problem that specifically affects the SetTransformer
!
When running BayesFlow to analyze regression models, I get OOM errors after about 50-60 epochs. The simulations need a data set of covariates from which a one-dimensional outcome is simulated. Here are my batched Prior/Simulation classes
The simulation is set up as:
This is on a NVIDIA GeForce RTX 3070 with 8Gb of RAM. The sample is ~300 with up to 4 covariates, thus a small data set.
Thank you.