molgenis / molgenis-compute

MOLGENIS Compute is a framework for bioinformatics which enables large scale data and computational workflow management in a distributed execution environment.
http://wiki.gcc.rug.nl/wiki/ComputeStart
GNU Lesser General Public License v3.0
4 stars 16 forks source link

Performance issue generating jobs #267

Open npklein opened 7 years ago

npklein commented 7 years ago

With a dataset where I have multiple parameter files the parsing of on of the parameter files takes very long. The issue can be reproduced by running generateJobs_phasing.sh from https://github.com/molgenis/molgenis-compute/pull/266.

fdlk commented 7 years ago

First look comment: Parameter files chromosomes_X_Y.csv and chromosome_chunks.csv contain related data. Is there a reason why the CHR column from chromosomes_X_Y.csv cannot be merged into chromosome_chunks.csv i.e.

CHR, chromosomeChunk
1, 1:1-5500000
1, 1:4500001-10500000
1, 1:9500001-15500000
1, 1:14500001-20500000
[...]
2, 2:1-5500000
2, 2:4500001-10500000
2, 2:9500001-15500000
2, 2:14500001-20500000
[...]

I'd expect that to speed up things by a factor 25 or so

fdlk commented 7 years ago

Talked with Niek and Freerk. Two questions:

  1. Is parameter solving slower than it should be because it fails to sufficiently collapse the problem, i.e. solves the same parameter value many times for each different version of independent parameters.
  2. Why does the above example give such a long #list of chromosomeChunks and why does it stop to do so if you add #list CHR?
fdlk commented 7 years ago

Answer to number 2:

Behaviour of #list parameters is not completely specified but what specifications exist can be found here: http://molgenis.github.io/pipelines/mc-parameters#3listsofparameters

Behaviour is dependent on what file the parameters are defined in(!) I find this odd and impractical, since I'd think you should be able to specify the parameter space any way you like and it should be collapsed for each step script depending on the parameters defined in that script.

I suspect the reason for this dependence is that the implementation of #list filters the parameters in the original file.

Things to do:

For now:

fdlk commented 7 years ago

Answer to 1: Parameter solving no longer is done using freemarker and looks to me to be reasonably efficient. The slowness comes from generating way too many jobs for this combination of chromosome, chunk and sample.