Closed danchern97 closed 3 years ago
I am struggling with this too. Has anyone fixed this / does anyone have a good workaround? (It seems the commit on Feb 12 (d518cfbc7bbb5f7270afc07570c591b0a32b65c9) has not completely fixed this issue.)
There is a standard way to handle running out of device or host memory on a single run of a program. Borrowing from Tensorflow, on the programmer's end you may save "checkpoint files" to disk/persistent memory every once and a while. You can save all your necessary parameters/variables, write them out to a file then read the latest checkpoint file back upon running your program a second time after memory is used up for whatever reason. You may use whatever format for your checkpoint file what is most important is that you use persistent memory.
You may use the sync command to clear cpu memory btw.
Hi, thanks for responding! Are you referring to os.sync(), or some other sync command?
There is a standard way to handle running out of device or host memory on a single run of a program. Borrowing from Tensorflow, on the programmer's end you may save "checkpoint files" to disk/persistent memory every once and a while. You can save all your necessary parameters/variables, write them out to a file then read the latest checkpoint file back upon running your program a second time after memory is used up for whatever reason. You may use whatever format for your checkpoint file what is most important is that you use persistent memory.
You may use the sync command to clear cpu memory btw.
Your answer really helped, I've figured it out a few months ago though. Still thanks!
Like last time, I am using the python bindings for calculating features of a large number of matrices. Now the GPU memory usage looks OK, but the used RAM is increasing slowly. It's fine for a small number of matrices, but the large number it is really a pain.
I guess there are some variables that are not freed in the ripser++.cu, but I can't trace them yet.