turi-code / SFrame

SFrame: Scalable tabular and graph data-structures built for out-of-core data analysis and machine learning.
BSD 3-Clause "New" or "Revised" License
889 stars 326 forks source link

Eats all ram and Crashes. #383

Open basejn opened 7 years ago

basejn commented 7 years ago

I have csv file with 15 000 text documents. I load them in SFrame. Then i count the words with .apply , from a predefined vocabulary , and create a new column with a wordcount vector .

The vocabulary size . respectively the vector size , is 50 000 . This means that after each apply a 50 000 array is generated. If the vocabulary is 10 000 for eaxample , there is no problem , but with bigger sizes the problems shows up.

Generating the vectors is ok , it takes some time(1 minute) but the ram stays in reasonable borders. max peaks of 2-3 gb for the main python process and peaks of 1,5gb for each worker. The problem comes when i try to get a row from the sframe.

No matter if using indexing (mySframe[0]) or with iteration (for row in mySframe:...). Then the ram starts to expand and finaly it crashes . (Even the windows shows a message , to close programs to prevent data loss) The problem happens on the first reading of the first row (mySframe[0]) , not ahead in iteration.

My final goal it to use this vectors for training a model with SGD . I will need only small batches of data in the same time. So i will have to iterate the dataset acouple of times.