ncsa / Genomics_EpiQuant

0 stars 0 forks source link

Repeated lookup task in SEAMSDense #6

Closed jacobrh91 closed 6 years ago

jacobrh91 commented 6 years ago

While testing, Jacob noticed that most tasks listed in the Spark GUI were calls to the lookup function at SEAMSDense.scala line 149.

This is a lookup in an RDD to to gather information about the SNPs that were previously included in the model. It searches a key-value RDD, and then takes the first value because we have additional knowledge that the key is only present once.

This call is called over and over again during execution, and takes 0.1 s every time.

These calls add up, and probably has a substantial impact on the final walltime.


Investigate whether a more efficient alternative exists, where we don't have to repeatedly search the full RDD for this information.

jacobrh91 commented 6 years ago

Update: during my testing on distributed Spark on iForge, this call to the lookup function appears to take 10 to 15 seconds every time (!!!!!). This has slowed things down tremendously.

jacobrh91 commented 6 years ago

lookupfunction This was on Aug27th (Another user of the cluster was running a program that tends to overwhelm the file system (MAKER), and this may have caused the problem.

jacobrh91 commented 6 years ago

lookupfunction_smooth The next day, Aug. 28th it appeared to run more smoothly. But soon the lookup function once again slowed down to 10 to 15 seconds. The same MAKER program that was running yesterday is still running. I should wait until its not before testing again.

jacobrh91 commented 6 years ago

This was fixed in PR #13