Open npklein opened 7 years ago
First look comment:
Parameter files chromosomes_X_Y.csv
and chromosome_chunks.csv
contain related data.
Is there a reason why the CHR
column from chromosomes_X_Y.csv
cannot be merged into chromosome_chunks.csv
i.e.
CHR, chromosomeChunk
1, 1:1-5500000
1, 1:4500001-10500000
1, 1:9500001-15500000
1, 1:14500001-20500000
[...]
2, 2:1-5500000
2, 2:4500001-10500000
2, 2:9500001-15500000
2, 2:14500001-20500000
[...]
I'd expect that to speed up things by a factor 25 or so
Talked with Niek and Freerk. Two questions:
Answer to number 2:
Behaviour of #list
parameters is not completely specified but what specifications exist can be found here: http://molgenis.github.io/pipelines/mc-parameters#3listsofparameters
Behaviour is dependent on what file the parameters are defined in(!) I find this odd and impractical, since I'd think you should be able to specify the parameter space any way you like and it should be collapsed for each step script depending on the parameters defined in that script.
I suspect the reason for this dependence is that the implementation of #list
filters the parameters in the original file.
Things to do:
#list
parameters should work and implement it.For now:
Answer to 1: Parameter solving no longer is done using freemarker and looks to me to be reasonably efficient. The slowness comes from generating way too many jobs for this combination of chromosome, chunk and sample.
With a dataset where I have multiple parameter files the parsing of on of the parameter files takes very long. The issue can be reproduced by running generateJobs_phasing.sh from https://github.com/molgenis/molgenis-compute/pull/266.