turi-code / SFrame

SFrame: Scalable tabular and graph data-structures built for out-of-core data analysis and machine learning.
BSD 3-Clause "New" or "Revised" License
890 stars 326 forks source link

Memory Leak #386

Open basejn opened 7 years ago

basejn commented 7 years ago

SFrame is stated to deal with large amounts of data without using mush RAM , but it leaks memory on simple tasks.

This sample of code continues to increase the RAM usage forever , and is strangely slow. The speed can be explained with the disk storage that the library uses to deal with large sets.

import sframe as sf
data = sf.SFrame({'a':['string']*1000,'b':[1]*1000,'c':[{'key1':1}]*1000})
for i in xrange(10000):
    a = data.to_numpy()

Another example is:

suma=0
for i in xrange(10000):
    for row in datain:
        suma+=row['b']

The RAM usage steadily increases.

This are just samples , not real usage.

The thing that i am trying to accomplish with the library is to read the data from the SFrame one by one or batch by batch and agregate it without loading it in RAM .Actually to construct a sparse matrix which i will use for training with Gradient Descent.

If i iterate it batch by batch , after the iteration SFrame uses a lot of ram and doesn't release it. It uses no less memory that the real size of the data so using it becomes pointless.