Add example doc with CPU/Mem resources for very large samples

Faizal-Eeman commented 6 months ago

When running very large sample BAMs through call-sSNV, it likely that the pipeline would fail because of default resource configurations.

Although the base_resouce_update function in template.config is a great utility to update resources on a case-by-case basis, it is often unclear on how much of the resource is to be updated for a successful run. It would be nice provide examples of resource configurations that worked for large BAMs, perhaps in a doc/ dir of the repo.

Here are the resources I set in my pipeline run's config,

base_resource_update {
        cpus = [
            ['call_sIndel_Manta', 0.1]
        ]
        memory = [
            [['run_validate_PipeVal', 'call_sSNV_Strelka2', 'call_sSNV_Mutect2', 'call_sIndel_Manta', 'concat_VCFs_BCFtools', 'plot_VennDiagram_R', 'run_LearnReadOrientationModel_GATK', 'convert_BAM2Pileup_SAMtools'], 10],
            [['call_sSNV_MuSE', 'run_sump_MuSE'], 2]
            ]
        }

Nextflow trace files

Case 1:

Normal - 2.5TB Tumor - 1.1TB - /hot/data/unregistered/Zook-Mootor-BNCH-GIAB/analysis/GIAB/AshkenazimParents/somatic-variants/call-sSNV-7.0.0/HG002-T/log-call-sSNV-7.0.0-20231201T000959Z/nextflow-log/trace.txt

Case 2:

Normal - 369GB

Tumor 1 - 312GB - /hot/data/unregistered/Zook-Mootor-BNCH-GIAB/analysis/GIAB/CancerGIAB/metapipeline/metapipeline-DNA-5.3.1/BNCH000122/main_workflow/output/call-sSNV-8.0.0/ZMBNGIAB000008-T001-C01-F/log-call-sSNV-8.0.0-20240421T231004Z/nextflow-log/trace.txt
Tumor 2 - 328GB - /hot/data/unregistered/Zook-Mootor-BNCH-GIAB/analysis/GIAB/CancerGIAB/metapipeline/metapipeline-DNA-5.3.1/BNCH000122/main_workflow/output/call-sSNV-8.0.0/ZMBNGIAB000008-T001-C02-F/log-call-sSNV-8.0.0-20240421T231102Z/nextflow-log/trace.txt

sorelfitzgibbon commented 6 months ago

@Faizal-Eeman, congrats on getting 2.5/1.1 TB samples through the pipeline! To help future users with large input files, I summarized the maximum resources actually used for each process:

Maximum values: Tool	realtime	%cpu*	peak_rss
run_validate_PipeVal	16h 24m	64%	20.MB
call_sSNV_SomaticSniper	5d 5h 4m	85%	18.GB
convert_BAM2Pileup_SAMtools	9d 1h 46m	90%	25.GB
call_sIndel_Manta	4d 4h 24m	239%	5.GB
call_sSNV_Strelka2	23h 33m	2530%	20.GB
call_sSNV_Mutect2	1d 7h 32m	125%	23.GB
run_LearnReadOrientationModel_GATK	22m	103%	28.GB
call_sSNV_MuSE	1d 9h 27m	1589%	119.GB
run_sump_MuSE	3m	172%	3.GB

*don't trust the %cpu numbers

sorelfitzgibbon commented 6 months ago

Maximum values for `Case 2` tumors 1 and 2: Tumor	Tool	realtime	%cpu*	peak_rss
tumor1	call_sSNV_SomaticSniper	13h 30m	93%	3.2 GB
tumor2	call_sSNV_SomaticSniper	12h 47m	96%	2.6 GB
tumor1	convert_BAM2Pileup_SAMtools	12h 4m	95%	13.7 GB
tumor2	convert_BAM2Pileup_SAMtools	11h 23m	96%	13.5 GB
tumor1	call_sIndel_Manta	1d 12h 52m	79%	< 1 GB
tumor2	call_sIndel_Manta	1d 11h 34m	79%	< 1 GB
tumor1	call_sSNV_Strelka2	7h 3m	703%	3.3 GB
tumor2	call_sSNV_Strelka2	6h 27m	750%	2.8 GB
tumor1	call_sSNV_Mutect2	6h 40m	100%	2.8 GB
tumor2	call_sSNV_Mutect2	6h 13m	100%	2.5 GB
tumor1	run_LearnReadOrientationModel_GATK	6m	98%	2.7 GB
tumor2	run_LearnReadOrientationModel_GATK	8m	98%	3.2 GB
tumor1	call_sSNV_MuSE	5h 22m	1196%	62.5 GB
tumor2	call_sSNV_MuSE	6h 10m	1198%	67.5 GB
tumor1	run_sump_MuSE	< 1m	1427%	9.1 GB
tumor2	run_sump_MuSE	6m	1126%	32.4 GB

*don't trust the %cpu numbers

Faizal-Eeman commented 6 months ago

Here are the failed logs for the 2TB sample that lead me to the CPU/memory update in the description. The memory allocation was default for these failed logs and as I identified error code 137 I updated that process's allocation accordingly. I also updated allocation for processes where I anticipated a memory error 137.

/hot/data/unregistered/Zook-Mootor-BNCH-GIAB/analysis/GIAB/AshkenazimParents/somatic-variants/test/failed/call-sSNV-7.0.0/HG002-T/log-call-sSNV-7.0.0-20231102T215524Z
/hot/data/unregistered/Zook-Mootor-BNCH-GIAB/analysis/GIAB/AshkenazimParents/somatic-variants/test/failed/call-sSNV-7.0.0/HG002-T/log-call-sSNV-7.0.0-20231112T041924Z
/hot/data/unregistered/Zook-Mootor-BNCH-GIAB/analysis/GIAB/AshkenazimParents/somatic-variants/test/failed/call-sSNV-7.0.0/HG002-T/log-call-sSNV-7.0.0-20231114T003236Z
/hot/data/unregistered/Zook-Mootor-BNCH-GIAB/analysis/GIAB/AshkenazimParents/somatic-variants/test/failed/call-sSNV-7.0.0/HG002-T/log-call-sSNV-7.0.0-20231115T032828Z

tyamaguchi-ucla commented 5 months ago

@sorelfitzgibbon @Faizal-Eeman do you guys think it's worth updating the M64 config based on these results? https://github.com/uclahs-cds/pipeline-call-sSNV/blob/main/config/M64.config

sorelfitzgibbon commented 5 months ago

@sorelfitzgibbon @Faizal-Eeman do you guys think it's worth updating the M64 config based on these results? https://github.com/uclahs-cds/pipeline-call-sSNV/blob/main/config/M64.config

Yes, it looks like several values can be substantially lowered. I'll work on this.

sorelfitzgibbon commented 5 months ago

@Faizal-Eeman it looks like these files have moved, are they still easily accessible? I'd like to check a couple little things, but not urgent.

Faizal-Eeman commented 5 months ago

@Faizal-Eeman it looks like these files have moved, are they still easily accessible? I'd like to check a couple little things, but not urgent.

Yes. I've updated the file paths here now.

tyamaguchi-ucla commented 5 months ago

It looks like our configs miss a few processes in SomaticSniper (e.g. generate_ReadCount_bam_readcount) @sorelfitzgibbon

uclahs-cds / pipeline-call-sSNV