thouis / numpy-trac-migration

numpy Trac to github issues migration
2 stars 3 forks source link

np.loadtxt leaks memory (Trac #2198) #5988

Open numpy-gitbot opened 12 years ago

numpy-gitbot commented 12 years ago

Original ticket http://projects.scipy.org/numpy/ticket/2198 on 2012-08-09 by trac user tanriol, assigned to unknown.

The amount of memory leaked far exceeds the amount the loaded data takes and does not go away when the loaded array is deleted.

>>> import numpy as np
>>> # Python consuming 13M RAM
>>> arr = np.zeros((10000, 5000), dtype='<i8')
>>> # Python consuming 394M RAM
>>> np.savetxt('array.txt', arr, '%d')
>>> # Python consuming 395M RAM
>>> del arr
>>> # Python consuming 14M RAM
>>> arr = np.loadtxt('array.txt', dtype='<i8')
>>> # Python consuming 2245M(!)
>>> del arr
>>> # Python consuming 1863M(!)
>>> import gc
>>> gc.collect()
6
>>> # Python consuming 1863M(!)

CPython 2.7.3, Numpy 1.6.2

numpy-gitbot commented 12 years ago

trac user tanriol wrote on 2012-08-09

BTW, as no 1.6.2 was available in 'Version' field, 1.6.1 was selected

numpy-gitbot commented 12 years ago

atmention:rgommers wrote on 2012-08-10

I can reproduce the memory usage, but executing gc.collect() before and after the loadtxt call shows there are no new unreachable objects being created. Also, after exiting the interpreter the memory is freed. Hence this isn't a memory leak.

The gc.collect docs say that some objects in particular int and float aren't freed by it. So it looks like that is the case here.

numpy-gitbot commented 12 years ago

trac user tanriol wrote on 2012-08-11

Do you mean that this kind of excessive memory consumption is likely unfixable in numpy? If so, are there any efficient workarounds to load CSV files without such overconsumption?

numpy-gitbot commented 12 years ago

atmention:rgommers wrote on 2012-08-11

The memory consumption shouldn't be a problem as long as the memory is freed before you start swapping to disk. This should be handled by the memory allocator of your OS.

If you do see swapping it is a serious problem. IIRC pandas has a different loadtxt function (implemented in C) which you could try.

numpy-gitbot commented 12 years ago

trac user tanriol wrote on 2012-08-12

Yes, swapping is observed with the real data files (which are larger). Pandas leaks less memory, but still too much. Probably I'll have to use pandas with its lazy chunk-by-chunk method for loading as neither loadtxt nor genfromtxt seem to be able to build the numpy arrays chunk-by-chunk.

numpy-gitbot commented 12 years ago

atmention:rgommers wrote on 2012-08-12

Then it's probably a Python issue, see these links:

http://pushingtheweb.com/2010/06/python-and-tcmalloc/

http://hg.python.org/cpython/rev/f8a697bc3ca8 (so claim is that issue is fixed/improved for Python 3.3.)

http://mail.scipy.org/pipermail/numpy-discussion/2011-May/056427.html

Not sure if rebuilding Python against tcmalloc is possible and worth it for you. If that does solve the issue, that would be good to know.