simonzhang00 / ripser-plusplus

Ripser++: GPU-accelerated computation of Vietoris–Rips persistence barcodes
MIT License
98 stars 14 forks source link

Another memory issue. #7

Closed danchern97 closed 3 years ago

danchern97 commented 3 years ago

Like last time, I am using the python bindings for calculating features of a large number of matrices. Now the GPU memory usage looks OK, but the used RAM is increasing slowly. It's fine for a small number of matrices, but the large number it is really a pain.

I guess there are some variables that are not freed in the ripser++.cu, but I can't trace them yet.

timxtc commented 3 years ago

I am struggling with this too. Has anyone fixed this / does anyone have a good workaround? (It seems the commit on Feb 12 (d518cfbc7bbb5f7270afc07570c591b0a32b65c9) has not completely fixed this issue.)

simonzhang00 commented 3 years ago

There is a standard way to handle running out of device or host memory on a single run of a program. Borrowing from Tensorflow, on the programmer's end you may save "checkpoint files" to disk/persistent memory every once and a while. You can save all your necessary parameters/variables, write them out to a file then read the latest checkpoint file back upon running your program a second time after memory is used up for whatever reason. You may use whatever format for your checkpoint file what is most important is that you use persistent memory.

You may use the sync command to clear cpu memory btw.

timxtc commented 3 years ago

Hi, thanks for responding! Are you referring to os.sync(), or some other sync command?

danchern97 commented 3 years ago

There is a standard way to handle running out of device or host memory on a single run of a program. Borrowing from Tensorflow, on the programmer's end you may save "checkpoint files" to disk/persistent memory every once and a while. You can save all your necessary parameters/variables, write them out to a file then read the latest checkpoint file back upon running your program a second time after memory is used up for whatever reason. You may use whatever format for your checkpoint file what is most important is that you use persistent memory.

You may use the sync command to clear cpu memory btw.

Your answer really helped, I've figured it out a few months ago though. Still thanks!