quinngroup / dr1dl-pyspark

Dictionary Learning in PySpark
Apache License 2.0
1 stars 1 forks source link

Swap/disk memory problems and runtime analysis #68

Open magsol opened 8 years ago

magsol commented 8 years ago

@quinngroup/bigneuron

I ran 4tasks03.txt last night, and it executed for an hour before crashing due to OSError: [Errno 28] No space left on device. The logs are still up; you can view them under "Jobs" in the BlueData web UI.

I looked through the logs and found a few figures that are informative. First, the memory usage:

screen shot 2016-03-12 at 11 14 26 am

Swap memory (purple) is, far and away, the biggest problem. This means our intermediate results are becoming so large as the job progresses that they completely saturate the available swap space on the hard disks and cause the job to crash. This is problematic for many reasons.

The next figures show the specific jobs that were executing and which led to the crash.

screen shot 2016-03-12 at 11 18 06 am screen shot 2016-03-12 at 11 18 11 am

Note the enormous discrepancy in runtimes between the matrix_vector and vector_matrix operations; the latter is 260x slower than the former, and whose flatMap operation is likely cause of the swap space saturation.

Keep in mind: this is the largest dataset, so we expect it to be challenging. On the other hand, our framework still needs to work; it should scale gracefully.

I have a few ideas on how to mitigate these issues--change the orientation of S, optimize the multiplication operations, and others--but I also have a real worry: the amount of swap space we're using shouldn't continually increase. That suggests u is becoming less sparse over time, resulting in more multiplication operations in vector_matrix. Is that possible? My understanding what that it should be going in the opposite direction--more sparse.

Please advise.

MOJTABAFA commented 8 years ago

@magsol , As I know the u vector will reconstruct the Dictionary, as a consequence sparsity of U will lead the sparsity of Dictionary and I think the dictionary could not be sparse . Is there any possibility to shrink the size of our U vector ? here our U vector size is P which is number of observation ( columns) . Again the problem here is the number of observations which are extremely large.Thus is there any programming solution for partitioning the U vector ? for example dividing U to 10 part and then merging them ?I know it seems theoretically impossible, but I just wanted to know for my self.