nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
407 stars 415 forks source link

Optimize (pre)processing #15

Closed szilvajuhos closed 3 years ago

szilvajuhos commented 5 years ago

Using the "old" 2.3 release version of Sarek on a 48-core node with +700G memory, the preprocessing of a 45x/45x tumour/normal WGS pair took 2d 1h 11m 24s - that is actually pretty good. OTOH, there are pretty long parts using only a single CPU Grafana showing on Munin CPU utilisation graph I know @MaxUlysse already managed to speed up recalibration, would be nice to

EDIT: add splitting fastq files

maxulysse commented 5 years ago

I added splitting fastq to the list which would be a good improvement as discussed on the slack channel with RationalTangle

szilvajuhos commented 4 years ago

image

Hi, this is the latest Grafana plot about CPU usage I have managed to get from a latest full-blow run of a 90x/90x test set. It takes 4 days and 8 hours:

nextflow run nf-core/sarek -r dev -profile munin --custom_config_base 'https://raw.githubusercontent.com/MaxUlysse/nf-core_configs/MUNIN' --tools Manta,Strelka,HaplotypeCaller,Mutect2,ControlFREEC,ASCAT,snpEff,VEP,merge --monochrome_logs --genome GRCh38 --noGVCF --annotation_cache --snpEff_cache /data1/cache/snpEff --vep_cache /data1/cache/VEP --species homo_sapiens --max_cpus 48 --input ../fastq/swid.tsv

The dips are due to

Other short dips are due to some gather steps (merging BAMs, stats, whatever). These are short and can not be parallelized. But resolving the mentioned ones we should go down to 3 days or so with a complete run.