Closed jacobrh91 closed 6 years ago
Update: during my testing on distributed Spark on iForge, this call to the lookup function appears to take 10 to 15 seconds every time (!!!!!). This has slowed things down tremendously.
This was on Aug27th (Another user of the cluster was running a program that tends to overwhelm the file system (MAKER), and this may have caused the problem.
The next day, Aug. 28th it appeared to run more smoothly. But soon the lookup function once again slowed down to 10 to 15 seconds. The same MAKER program that was running yesterday is still running. I should wait until its not before testing again.
This was fixed in PR #13
While testing, Jacob noticed that most tasks listed in the Spark GUI were calls to the lookup function at SEAMSDense.scala line 149.
This is a lookup in an RDD to to gather information about the SNPs that were previously included in the model. It searches a key-value RDD, and then takes the first value because we have additional knowledge that the key is only present once.
This call is called over and over again during execution, and takes 0.1 s every time.
These calls add up, and probably has a substantial impact on the final walltime.
Investigate whether a more efficient alternative exists, where we don't have to repeatedly search the full RDD for this information.