Optimize (pre)processing

szilvajuhos commented 5 years ago

Using the "old" 2.3 release version of Sarek on a 48-core node with +700G memory, the preprocessing of a 45x/45x tumour/normal WGS pair took 2d 1h 11m 24s - that is actually pretty good. OTOH, there are pretty long parts using only a single CPU Grafana showing on Munin CPU utilisation graph I know @MaxUlysse already managed to speed up recalibration, would be nice to

[x] split fastq files with https://www.nextflow.io/docs/latest/operator.html#splitfastq
[x] have a look at gains after recalibration speedup
[ ] optimise some of the QC steps
[ ] once we are happy with preprocessing, have a look at the variant call/annotations parts also

EDIT: add splitting fastq files

maxulysse commented 5 years ago

I added splitting fastq to the list which would be a good improvement as discussed on the slack channel with RationalTangle

szilvajuhos commented 4 years ago

Hi, this is the latest Grafana plot about CPU usage I have managed to get from a latest full-blow run of a 90x/90x test set. It takes 4 days and 8 hours:

nextflow run nf-core/sarek -r dev -profile munin --custom_config_base 'https://raw.githubusercontent.com/MaxUlysse/nf-core_configs/MUNIN' --tools Manta,Strelka,HaplotypeCaller,Mutect2,ControlFREEC,ASCAT,snpEff,VEP,merge --monochrome_logs --genome GRCh38 --noGVCF --annotation_cache --snpEff_cache /data1/cache/snpEff --vep_cache /data1/cache/VEP --species homo_sapiens --max_cpus 48 --input ../fastq/swid.tsv

The dips are due to

BamQC unparallelized (from Dec 12, 14:00...) we can run more of them in parallel
Deduplication (Dec 14: 0.00 ... takes 8 hours) Maxime already playing with Spark #64
Mpileup for ControlFREEC - (Dec 15: 0.00 ...) looks we can run more mpileup processes in parallel, as not all the CPUs are used
BamQC again (Dec 16: 9:30...)

Other short dips are due to some gather steps (merging BAMs, stats, whatever). These are short and can not be parallelized. But resolving the mentioned ones we should go down to 3 days or so with a complete run.

nf-core / sarek

Optimize (pre)processing #15