Swap/disk memory problems and runtime analysis

@quinngroup/bigneuron

I ran 4tasks03.txt last night, and it executed for an hour before crashing due to OSError: [Errno 28] No space left on device. The logs are still up; you can view them under "Jobs" in the BlueData web UI.

I looked through the logs and found a few figures that are informative. First, the memory usage:

Swap memory (purple) is, far and away, the biggest problem. This means our intermediate results are becoming so large as the job progresses that they completely saturate the available swap space on the hard disks and cause the job to crash. This is problematic for many reasons.

The next figures show the specific jobs that were executing and which led to the crash.

Note the enormous discrepancy in runtimes between the matrix_vector and vector_matrix operations; the latter is 260x slower than the former, and whose flatMap operation is likely cause of the swap space saturation.

Keep in mind: this is the largest dataset, so we expect it to be challenging. On the other hand, our framework still needs to work; it should scale gracefully.

I have a few ideas on how to mitigate these issues--change the orientation of S, optimize the multiplication operations, and others--but I also have a real worry: the amount of swap space we're using shouldn't continually increase. That suggests u is becoming less sparse over time, resulting in more multiplication operations in vector_matrix. Is that possible? My understanding what that it should be going in the opposite direction--more sparse.

Please advise.

quinngroup / dr1dl-pyspark

Swap/disk memory problems and runtime analysis #68