I have csv file with 15 000 text documents. I load them in SFrame.
Then i count the words with .apply , from a predefined vocabulary , and create a new column with a wordcount vector .
The vocabulary size . respectively the vector size , is 50 000 . This means that after each apply a 50 000 array is generated.
If the vocabulary is 10 000 for eaxample , there is no problem , but with bigger sizes the problems shows up.
Generating the vectors is ok , it takes some time(1 minute) but the ram stays in reasonable borders. max peaks of 2-3 gb for the main python process and peaks of 1,5gb for each worker.
The problem comes when i try to get a row from the sframe.
No matter if using indexing (mySframe[0]) or with iteration (for row in mySframe:...).
Then the ram starts to expand and finaly it crashes . (Even the windows shows a message , to close programs to prevent data loss)
The problem happens on the first reading of the first row (mySframe[0]) , not ahead in iteration.
My final goal it to use this vectors for training a model with SGD . I will need only small batches of data in the same time. So i will have to iterate the dataset acouple of times.
I have csv file with 15 000 text documents. I load them in SFrame. Then i count the words with .apply , from a predefined vocabulary , and create a new column with a wordcount vector .
The vocabulary size . respectively the vector size , is 50 000 . This means that after each apply a 50 000 array is generated. If the vocabulary is 10 000 for eaxample , there is no problem , but with bigger sizes the problems shows up.
Generating the vectors is ok , it takes some time(1 minute) but the ram stays in reasonable borders. max peaks of 2-3 gb for the main python process and peaks of 1,5gb for each worker. The problem comes when i try to get a row from the sframe.
No matter if using indexing (mySframe[0]) or with iteration (for row in mySframe:...). Then the ram starts to expand and finaly it crashes . (Even the windows shows a message , to close programs to prevent data loss) The problem happens on the first reading of the first row (mySframe[0]) , not ahead in iteration.
My final goal it to use this vectors for training a model with SGD . I will need only small batches of data in the same time. So i will have to iterate the dataset acouple of times.