microsoft / MaskFlownet

[CVPR 2020, Oral] MaskFlownet: Asymmetric Feature Matching with Learnable Occlusion Mask
https://arxiv.org/abs/2003.10955
MIT License
369 stars 70 forks source link

RAM leak problem when training in command batch_queue.get(). Where do you release resources once a batch is trained. #26

Open vivasvan1 opened 4 years ago

vivasvan1 commented 4 years ago

I have noticed that your training loop leaks small amounts of RAM memory.Any idea on what may have caused this?

time taken= 9.865329265594482 | steps= 1 | cpu= 51.8 | ram= 34.50078675328186 | gpu= [3101] [5613] time taken= 0.934636116027832 | steps= 2 | cpu= 27.0 | ram= 29.34866251942084 | gpu= [5613] [3045] time taken= 0.8695635795593262 | steps= 3 | cpu= 29.4 | ram= 29.217970957706278 | gpu= [3045] [3021] time taken= 0.8483304977416992 | steps= 4 | cpu= 29.8 | ram= 29.033316428574086 | gpu= [3021] [2997] time taken= 0.8630681037902832 | steps= 5 | cpu= 30.2 | ram= 28.87988403913803 | gpu= [2997] [2997] time taken= 0.8645083904266357 | steps= 6 | cpu= 29.4 | ram= 28.714746447210654 | gpu= [2997] [2997] time taken= 0.864253044128418 | steps= 7 | cpu= 29.3 | ram= 28.573093657739385 | gpu= [2997] [2997] time taken= 0.8693573474884033 | steps= 8 | cpu= 29.3 | ram= 28.389703885656044 | gpu= [2997] [2997] time taken= 0.8704898357391357 | steps= 9 | cpu= 29.4 | ram= 28.298690976454438 | gpu= [2997] [2997] time taken= 0.8670341968536377 | steps= 10 | cpu= 29.5 | ram= 28.13385097442091 | gpu= [2997] [2997] time taken= 0.8750414848327637 | steps= 11 | cpu= 29.5 | ram= 27.959884882309396 | gpu= [2997] [2997] time taken= 0.8624210357666016 | steps= 12 | cpu= 29.9 | ram= 27.784356443255188 | gpu= [2997] [2997] time taken= 0.8561670780181885 | steps= 13 | cpu= 29.8 | ram= 27.644241201568796 | gpu= [2997] [2997] time taken= 0.8609695434570312 | steps= 14 | cpu= 29.7 | ram= 27.51883186047002 | gpu= [2997] [2997] time taken= 0.8462607860565186 | steps= 15 | cpu= 29.7 | ram= 27.36641623650461 | gpu= [2997] [2997] time taken= 0.8624782562255859 | steps= 16 | cpu= 29.2 | ram= 27.23760941078441 | gpu= [2997] [2997] time taken= 0.8649694919586182 | steps= 17 | cpu= 29.4 | ram= 27.113514425050127 | gpu= [2997] [2997] time taken= 0.8661544322967529 | steps= 18 | cpu= 29.3 | ram= 27.004993310427178 | gpu= [2997] [2997] time taken= 0.8687705993652344 | steps= 19 | cpu= 29.8 | ram= 26.82090916192486 | gpu= [2997] [2997] time taken= 0.8823645114898682 | steps= 20 | cpu= 29.6 | ram= 26.688630454109777 | gpu= [2997] [2997] time taken= 0.8795809745788574 | steps= 21 | cpu= 29.4 | ram= 26.517987449146226 | gpu= [2997] [2997] time taken= 0.8857841491699219 | steps= 22 | cpu= 29.1 | ram= 26.40289455770082 | gpu= [2997] [2997] time taken= 0.8605339527130127 | steps= 23 | cpu= 29.5 | ram= 26.274509317663572 | gpu= [2997] [2997] time taken= 0.8524265289306641 | steps= 24 | cpu= 29.8 | ram= 26.16445065525575 | gpu= [2997]

vivasvan1 commented 4 years ago

Can you check if it is only on my pc or is this happening with your code too?

also, is there any way in which I don't have to load the full dataset on the memory for training in mxnet?

vivasvan1 commented 4 years ago

I have found using pdb that after every run of

batch = batch_queue.get()

an extra 0.10-0.15% ram is consumed which seems to never get released.

(Pdb) print("| ram=",psutil.virtual_memory().available * 100 / psutil.virtual_memory().total)
 **ram= 28.66921519345203** 
(Pdb) n
> /home/mask/maskflownet/MaskFlownet/main.py(572)<module>()
-> loading_time.update(default_timer() - t0)
(Pdb) print("| ram=",psutil.virtual_memory().available * 100 / psutil.virtual_memory().total)
**ram= 28.542640291935687**

I cannot find why this is happening but i am sure of it. Can you help me fix this please?

simon1727 commented 4 years ago

Hi vivasvan1, thanks for pointing out this problem.

We import this Queue method from the python queue package directly without any modification. I search on Google and find that other people encounter the same problem. So maybe this is not a problem with our code but the python queue package.