piskvorky / bounter

Efficient Counter that uses a limited (bounded) amount of memory regardless of data size.
MIT License
935 stars 47 forks source link

Repeatedly saving object increases program memory #33

Open menshikh-iv opened 6 years ago

menshikh-iv commented 6 years ago

Description

"I'm using bounter to count the frequency of items in a large set. I was periodically pickling the bounter object. Doing this causes the memory to continually increase" (based on https://groups.google.com/forum/#!topic/gensim/LsReiXXOzKY thread)

Steps/Code/Corpus to Reproduce

import pickle as pkl
from bounter import bounter
import numpy as np
import psutil
import gc

def get_used_memory():
    """
    Return the current am't of used memory, in GB
    """
    return '{:.3f}'.format(psutil.virtual_memory().used / 1024.0 / 1024.0 / 1024.0)

def log(msg):
    print(msg, ', memory =', get_used_memory())

def main():
    log('Starting with np array')
    a = np.random.randint(0, 512, (8, 33554432), dtype='int32')
    log('Initialized array')
    for i in range(6):
        with open('array.pkl', 'wb') as f:
            pkl.dump(a, f, protocol=pkl.HIGHEST_PROTOCOL)
            log('Finished saving the ' + str(i) + 'th copy of the array')
    del a
    gc.collect()
    log('deleted array and performed gc.collect() ')

    counter = bounter(size_mb=1024, need_iteration=False, log_counting=1024)
    log('Initialized counter')
    for i in range(6):
        with open('counter.pkl','wb') as f:

            pkl.dump(counter, f, protocol=pkl.HIGHEST_PROTOCOL)
            log('Finished saving the ' + str(i) + 'th copy of the bounter')

    del counter
    gc.collect()
    log('deleted array and performed gc.collect() ')
    log('Finished')

if __name__ == '__main__':
    main()

Expected Results

Memory shouldn't increase significantly after each dump

Actual Results

I get the resulting log statements along with the two pkl files each 1.1 GB in size:

('Starting with np array', ', memory =', '3.539')
('Initialized array', ', memory =', '4.540')
('Finished saving the 0th copy of the array', ', memory =', '4.540')
('Finished saving the 1th copy of the array', ', memory =', '4.544')
('Finished saving the 2th copy of the array', ', memory =', '4.549')
('Finished saving the 3th copy of the array', ', memory =', '4.549')
('Finished saving the 4th copy of the array', ', memory =', '4.553')
('Finished saving the 5th copy of the array', ', memory =', '4.562')
('deleted array and performed gc.collect() ', ', memory =', '3.561')
('Initialized counter', ', memory =', '3.561')
('Finished saving the 0th copy of the bounter', ', memory =', '4.567')
('Finished saving the 1th copy of the bounter', ', memory =', '5.573')
('Finished saving the 2th copy of the bounter', ', memory =', '6.577')
('Finished saving the 3th copy of the bounter', ', memory =', '7.576')
('Finished saving the 4th copy of the bounter', ', memory =', '8.579')
('Finished saving the 5th copy of the bounter', ', memory =', '9.582')
('deleted array and performed gc.collect() ', ', memory =', '9.580')
('Finished', ', memory =', '9.580')

Here, I see 2 suspicious places, first with memory increasing

('Finished saving the 0th copy of the bounter', ', memory =', '4.567')
('Finished saving the 1th copy of the bounter', ', memory =', '5.573')
('Finished saving the 2th copy of the bounter', ', memory =', '6.577')
('Finished saving the 3th copy of the bounter', ', memory =', '7.576')
('Finished saving the 4th copy of the bounter', ', memory =', '8.579')
('Finished saving the 5th copy of the bounter', ', memory =', '9.582')

and the second one (that looks like memory-leak)

('Finished saving the 5th copy of the bounter', ', memory =', '9.582')
('deleted array and performed gc.collect() ', ', memory =', '9.580')
('Finished', ', memory =', '9.580')

Versions

Centos 7
python = 3.6.1
bounter = 1.0.1
numpy = 1.14.2
jonsnowseven commented 5 years ago

Hello. I am having the same problem...

Any workaround or solution available?

Thank you in advance.