nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
404 stars 405 forks source link

Sarek stops after alignment and does not perform variant calling #1313

Closed sci-kai closed 11 months ago

sci-kai commented 11 months ago

Description of the bug

Hi, I recently run sarek with the UMI pipeline for WES data, it successfully finishs but only performs alignment and does not perform variant calling. For the UMI pipeline I skipped markduplicates and baserecalibration and extract read names from FASTQ file according to issue #746 . Also, this workflow is running in offline mode. I do not know what configuration to change so the workflow also perform variant calling.

Command used and terminal output

nextflow run nf-core-sarek_3.3.2/3_3_2 \
-profile singularity \
-c config/run.config \
-params-file config/nf-params.json \
-plugins nf-validation@1.0.0 \
-resume

The workflow performs software dump and multiQC after finishing the following process for each sample: NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:BWAMEM2_MEM


### Relevant files

`run.config` file:

process {
    // extract UMIS from read names, also see https://github.com/nf-core/sarek/issues/746
    withName: 'FASTQTOBAM' {
        ext.args         = '--extract-umis-from-read-names'
    }

    // Increase memory for fgbio processes
    withName: 'GROUPREADSBYUMI' {
        publishDir       = [
            [   path: { "${params.outdir}/reports/umi/" },
                mode: params.publish_dir_mode,
                pattern: "*.{txt}"
            ]
        ]
        memory           = 90.GB
        ext.args         = '-Xmx80g'
    }

    withName: 'CALLUMICONSENSUS' {
        memory           = 90.GB
        ext.args         = '-Xmx80g -M 1 -S Coordinate'
        ext.prefix       = {"${meta.id}_umi-consensus"}
        publishDir       = [
            path: { "${params.outdir}/preprocessing/umi/${meta.sample}" },
            mode: params.publish_dir_mode,
            pattern: "*.{bam}"
        ]
    }
}

`nf-params.json file:

{ "input": "config/samplesheet.csv", "step": "mapping", "outdir": "results/", "fasta": "db/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta", "split_fastq": 50000000, "wes": true, "intervals": "data/Target_Region_Hg38_V8_chr.bed", "nucleotides_per_second": 200000.0, "no_intervals": false, "tools": "freebayes,mutect2,strelka", "skip_tools": "markduplicates,baserecalibrator", "trim_fastq": false, "clip_r1": 0, "clip_r2": 0, "three_prime_clip_r1": 0, "three_prime_clip_r2": 0, "trim_nextseq": 0, "save_trimmed": false, "umi_read_structure": "NA", "group_by_umi_strategy": "Adjacency", "save_split_fastqs": false, "aligner": "bwa-mem2", "save_mapped": true, "save_output_as_bam": true, "concatenate_vcfs": false, "only_paired_variant_calling": true, "joint_germline": false, "joint_mutect2": false, "ascat_min_base_qual": 20.0, "ascat_min_counts": 10.0, "ascat_min_map_qual": 35.0, "cf_coeff": 0.05, "cf_contamination_adjustment": false, "cf_contamination": 0.0, "cf_minqual": 0.0, "cf_mincov": 0.0, "cf_ploidy": "2", "ignore_soft_clipped_bases": false, "sentieon_haplotyper_emit_mode": "variant", "vep_cache": "db/vep/", "vep_include_fasta": false, "vep_dbnsfp": false, "dbnsfp_fields": "rs_dbSNP,HGVSc_VEP,HGVSp_VEP,1000Gp3_EAS_AF,1000Gp3_AMR_AF,LRT_score,GERP++_RS,gnomAD_exomes_AF", "vep_loftee": false, "vep_spliceai": false, "vep_spliceregion": false, "vep_custom_args": "--everything --filter_common --per_gene --total_length --offline --format vcf", "use_annotation_cache_keys": false, "vep_out_format": "vcf", "genome": "GATK.GRCh38", "save_reference": true, "build_only_index": false, "download_cache": false, "igenomes_base": "db/references", "igenomes_ignore": false, "custom_config_version": "master", "custom_config_base": "nf-core-sarek_3.3.2/configs", "test_data_base": null, "seq_platform": "ILLUMINA", "max_cpus": 16, "max_memory": "128.GB", "max_time": "240.h", "help": false, "version": false, "publish_dir_mode": "copy", "plaintext_email": false, "max_multiqc_email_size": "25.MB", "monochrome_logs": false, "validate_params": true, "validationShowHiddenParams": false, "validationFailUnrecognisedParams": false, "validationLenientMode": false, "snpeff_db": null, }



### System information

Nextflow version: 23.04.3
Hardware: HPC
Executor: local
Container engine: singularity
version of sarek: 3.3.2
asp8200 commented 11 months ago

I see you have "umi_read_structure": "NA" in your params-file. I wonder if that is valid.

I was able to run

nextflow run nf-core/sarek -r 3.3.2 -profile singularity,test_cache,umi -plugins nf-validation@1.0.0 --outdir results --skip_tools markduplicates,baserecalibrator --tools freebayes,strelka

but when I added --umi_read_structure NA to that cmd, FastqToBam crashed with

  Argument 'read-structures' could not be constructed from string
    ...Could not build a 'ReadStructure' from a string: NA: null
FriederikeHanssen commented 11 months ago

Can you please provide the .nextflow.log?

sci-kai commented 11 months ago

nextflow.log

Yes, I did not run the software version dump in this run since this produced another error probably unrelated to this.

sci-kai commented 11 months ago

@asp8200 I already discussed this setting in issue #802 and it worked in previous versions (but with other data and slightly different configuration). I added the flag --extract-umis-from-read-names to fgbio in my run.config file which extracts the UMIs from the read names. I need to give the 'NA' value to --umi_read_structure to trigger the run of the UMI workflow, otherwise it just skips this process.

FriederikeHanssen commented 11 months ago

do you also have the pipeline_info/execution_trace.html?

sci-kai commented 11 months ago

execution_trace_2023-11-06_09-55-29.txt Here you go

FriederikeHanssen commented 11 months ago

Hey! I am unable to reproduce. I don't have any UMI "real" data at the moment. Any chance you can extract a few reads and send them as a minimal reproducible example?

sci-kai commented 11 months ago

I cannot share from the current data, but I try to reproduce the error with another minimal datasets which I can share. I tried myself finding the problem and run Sarek without the UMI workflow, which still produces the same behaviour. Changing the following parameters:

"umi_read_structure": null,
"group_by_umi_strategy": null,

and removing the FASTQTOBAM modification in run.config (and running for only one sample). It still stops at the exact same timepoint: nextflow.log execution_trace_2023-11-06_11-50-07.txt

FriederikeHanssen commented 11 months ago

very odd. I have non-umi data. I'll check if I see the same when I run it with that. I have a hunch actually, can you check what happens if you run with --split_fastq 0 ?

sci-kai commented 11 months ago

I tested the --split_fastq 0 option, still has the same result. For the non-UMI workflow I also set --skip_tools null. I prepared some reads from the newest GIAB HG008 tumor-normal Illumina WGS data which also reproduced the error, but did not run the UMI workflow. For clarity I will provide all config files again for this minimal example:

sampledata (HG008, Illumina WGS, first 1000 lines (250 reads) extracted from here): HG008-Illumina-250reads.tar.gz .nextflow.log: nextflow.log execution_trace: execution_trace_2023-11-06_18-03-49.txt

nextflow run nf-core-sarek_3.3.2/3_3_2 \
-profile singularity \
-params-file config/nf-params-GIAB.json \
-plugins nf-validation@1.0.0 \
-resume

samplesheet.csv:

patient,status,lane,sample,sex,fastq_1,fastq_2
HG008,0,HG008-normal,1,XX,data/GIAB_HG008/HG008-N_Illumina_R1.fastq.gz,data/GIAB_HG008/HG008-N_Illumina_R2.fastq.gz
HG008,1,HG008-tumor,1,XX,data/GIAB_HG008/HG008-T_Illumina_R1.fastq.gz,data/GIAB_HG008/HG008-T_Illumina_R2.fastq.gz

nf-params-GIAB.json:

{
    "input": "config/samplesheet-GIAB.csv",
    "step": "mapping",
    "outdir": "results/",
    "fasta": "db/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta",
    "split_fastq": 50000000,
    "wes": true,
    "intervals": "data/Target_Region_Hg38_V8_chr.bed",
    "nucleotides_per_second": 200000.0,
    "no_intervals": false,
    "tools": "freebayes,mutect2,strelka",
    "skip_tools": null,
    "trim_fastq": false,
    "clip_r1": 0,
    "clip_r2": 0,
    "three_prime_clip_r1": 0,
    "three_prime_clip_r2": 0,
    "trim_nextseq": 0,
    "save_trimmed": false,
    "umi_read_structure": null,
    "group_by_umi_strategy": null,
    "save_split_fastqs": false,
    "aligner": "bwa-mem2",
    "save_mapped": true,
    "save_output_as_bam": true,
    "concatenate_vcfs": false,
    "only_paired_variant_calling": true,
    "joint_germline": false,
    "joint_mutect2": false,
    "ascat_min_base_qual": 20.0,
    "ascat_min_counts": 10.0,
    "ascat_min_map_qual": 35.0,
    "cf_coeff": 0.05,
    "cf_contamination_adjustment": false,
    "cf_contamination": 0.0,
    "cf_minqual": 0.0,
    "cf_mincov": 0.0,
    "cf_ploidy": "2",
    "ignore_soft_clipped_bases": false,
    "sentieon_haplotyper_emit_mode": "variant",
    "vep_cache": "db/vep/",
    "vep_include_fasta": false,
    "vep_dbnsfp": false,
    "dbnsfp_fields": "rs_dbSNP,HGVSc_VEP,HGVSp_VEP,1000Gp3_EAS_AF,1000Gp3_AMR_AF,LRT_score,GERP++_RS,gnomAD_exomes_AF",
    "vep_loftee": false,
    "vep_spliceai": false,
    "vep_spliceregion": false,
    "vep_custom_args": "--everything --filter_common --per_gene --total_length --offline --format vcf",
    "use_annotation_cache_keys": false,
    "vep_out_format": "vcf",
    "genome": "GATK.GRCh38",
    "save_reference": true,
    "build_only_index": false,
    "download_cache": false,
    "igenomes_base": "db/references",
    "igenomes_ignore": false,
    "custom_config_version": "master",
    "custom_config_base": "nf-core-sarek_3.3.2/configs",
    "test_data_base": null,
    "seq_platform": "ILLUMINA",
    "max_cpus": 16,
    "max_memory": "128.GB",
    "max_time": "240.h",
    "help": false,
    "version": false,
    "publish_dir_mode": "copy",
    "plaintext_email": false,
    "max_multiqc_email_size": "25.MB",
    "monochrome_logs": false,
    "validate_params": true,
    "validationShowHiddenParams": false,
    "validationFailUnrecognisedParams": false,
    "validationLenientMode": false,
    "snpeff_db": null,
}
itszhengan commented 11 months ago

Same thing happened here.

FriederikeHanssen commented 11 months ago

Can you test with the test_full profile to see if that works you?

sci-kai commented 11 months ago

I now also tried running it locally and a colleague at another institute tried the minimal example, both giving the same error as described. The test profile is working as expected also running the variant calling, while the test_full is getting stucked and does not finish at all. I am not sure if it just needs a lot of time and this is expected? Here is the nextflow.log and execution trace from the manually interrupted test_full run: nextflow.log execution_trace_2023-11-10_13-12-01.txt

sci-kai commented 11 months ago

I identified the problem: Within the samplesheet, I put the wrong header for lane and sample. Hence, I specified two different lanes ("HG008-normal" and "HG008-tumor") for one sample ("1") which then does not performs somatic variant calling. Fixing this with switching the header labels solved this issue.

maxulysse commented 11 months ago

Is the issue solved for you then?

FriederikeHanssen commented 11 months ago

Fantastic solve 🚀 . I am not sure we can lint for this, but we should keep it in mind