nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
409 stars 417 forks source link

joint germline vcftools error #918

Open TonyKess opened 1 year ago

TonyKess commented 1 year ago

Description of the bug

Joint germline genotyping completes actual genotyping, but fails at TSV count vcftools step

Command used and terminal output

Command:

nextflow run nf-core/sarek --skip_tools baserecalibrator --genome null --igenomes_ignore --joint_germline --intervals Ssal_v3.1_genomic.chroms.bed --fasta Ssal_v3.1_genomic.chroms.fna --input salmo5samp.csv -profile docker --tools haplotypecaller,manta -resume

Error:

Error executing process > 'NFCORE_SAREK:SAREK:VCF_QC_BCFTOOLS_VCFTOOLS:VCFTOOLS_TSTV_COUNT (joint_variant_calling)'

Caused by: Process NFCORE_SAREK:SAREK:VCF_QC_BCFTOOLS_VCFTOOLS:VCFTOOLS_TSTV_COUNT (joint_variant_calling) terminated with an error exit status (139)

Command executed:

vcftools \ --gzvcf joint_germline.vcf.gz \ --out joint_germline \ --TsTv-by-count \ \

cat <<-END_VERSIONS > versions.yml "NFCORE_SAREK:SAREK:VCF_QC_BCFTOOLS_VCFTOOLS:VCFTOOLS_TSTV_COUNT": vcftools: $(echo $(vcftools --version 2>&1) | sed 's/^.VCFtools (//;s/).//') END_VERSIONS

Command exit status: 139

Command output: (empty)

Command error:

VCFtools - 0.1.16 (C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted: --gzvcf joint_germline.vcf.gz --out joint_germline --TsTv-by-count

Using zlib version: 1.2.11 Warning: Expected at least 2 parts in FORMAT entry: ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles"> Warning: Expected at least 2 parts in FORMAT entry: ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"> Warning: Expected at least 2 parts in FORMAT entry: ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> Warning: Expected at least 2 parts in FORMAT entry: ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)"> Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> After filtering, kept 5 out of 5 Individuals Outputting Ts/Tv by Alternative Allele Count After filtering, kept 9896941 out of a possible 9896941 Sites Run Time = 49.00 seconds .command.sh: line 7: 27 Segmentation fault (core dumped) vcftools --gzvcf joint_germline.vcf.gz --out joint_germline --TsTv-by-count

Work dir: /genomics/Tony/Atlantic_Salmon/work/8d/0c7967703f8969b8e8948f2bf3fd38

Tip: when you have fixed the problem you can continue the execution adding the option -resume to the run command line

maxulysse commented 1 year ago

Segmentation fault sounds bad. Any chance it works again if you resume?

TonyKess commented 1 year ago

Same message unfortunately. This is using the joint-germline workflow with a species without any prior genomic info (indels, dbSNP etc). The joint-germline.vcf itself actually seems fine, but seems like vcftools is looking for something in it that isn't there?

TonyKess commented 1 year ago

quick update: disabling vcftools seems to lead to pipeline completion

maxulysse commented 1 year ago

Good idea, that would make sense

TonyKess commented 1 year ago

I'm going to try to use some high depth samples to build a SNP/indel reference, and will see if including that info in subsequent runs changes the performance here.

RaphaellaJackson commented 1 year ago

I have encountered the same error using different data. There is a workaround originally proposed by @FriederikeHanssen which is to add an ignore to the config file. This works as the tool appears to produce a good output before the seg fault.

Relevant Config Line

\\ within process {}
    withName:VCFTOOLS_TSTV_COUNT {  
    errorStrategy = 'ignore'
  }

Command Run:

#!/bin/bash
#PBS -l select=1:ncpus=2:mem=8gb
#PBS -l walltime=12:00:00

module load anaconda3/personal

cd ${PBS_O_WORKDIR}

nextflow run nf-core/sarek  \
-c Good_Imperial.config \
--input Ecoli_Samples.Sarek.csv \
--fasta WT_S295.fna \
--save_reference \
--outdir /rds/general/user/rjackso1/home/Projects/2023_Julian_Ecoli/Sarek_Results \
--igenomes_ignore \
--tools haplotypecaller \
--skip_tools baserecalibrator \
         --joint_germline \
         -resume

Error Message in Main Outfile

[5b/447b3f] NOTE: Process `NFCORE_SAREK:SAREK:VCF_QC_BCFTOOLS_VCFTOOLS:VCFTOOLS_TSTV_COUNT (joint_variant_calling)` terminated with an error exit status (139) -- Error is ignored

Relevant .command.sh

#!/bin/bash -euo pipefail
vcftools \
    --gzvcf joint_germline.vcf.gz \
    --out joint_germline \
    --TsTv-by-count \
     \

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:VCF_QC_BCFTOOLS_VCFTOOLS:VCFTOOLS_TSTV_COUNT":
    vcftools: $(echo $(vcftools --version 2>&1) | sed 's/^.*VCFtools (//;s/).*//')
END_VERSIONS

Relevant .command.err

WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container
WARNING: Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
        --gzvcf joint_germline.vcf.gz
        --out joint_germline
        --TsTv-by-count

Using zlib version: 1.2.11
Warning: Expected at least 2 parts in FORMAT entry: ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
Warning: Expected at least 2 parts in FORMAT entry: ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
Warning: Expected at least 2 parts in FORMAT entry: ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
Warning: Expected at least 2 parts in FORMAT entry: ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
After filtering, kept 11 out of 11 Individuals
Outputting Ts/Tv by Alternative Allele Count
/rds/general/user/rjackso1/home/Projects/2023_Julian_Ecoli/work/5b/447b3f6899f78519c3a6080f6469b9/.command.sh: line 7:    38 Segmentation fault      vcftools --gzvcf joint_germline.vcf.gz --out joint_germline --TsTv-by-count

Custom Config File Used:

//Profile config names for nf-core/configs
// you must create /tmp and /var/tmp in /rds/general/user/$USER/ephemeral/

params {
    // Config Params
    config_profile_description = 'Imperial College London - HPC Profile'

    // Resources
    max_memory = 920.GB
    max_cpus = 256
    max_time = 1000.h
}

process {
  // base params
  executor = 'pbspro'
  maxRetries = 3
  // resource specific params - modified for imperial queues
  withLabel:process_low {
    cpus = { 1 }
    memory = { 12.GB * task.attempt }
    time = { 4.h * task.attempt }
    errorStrategy = { task.attempt <= 4 ? 'retry' : 'finish' }
  }
  withLabel:process_medium {
    cpus = { 4 * task.attempt }
    memory = { 30.GB * task.attempt }
    time = { 16.h * task.attempt }
    errorStrategy = { task.attempt <= 4 ? 'retry' : 'finish' }
  }
  withLabel:process_high {
    cpus = { 8 * task.attempt }
    memory = { 92.GB * task.attempt }
    time = { 16.h * task.attempt }
    errorStrategy = { task.attempt <= 4 ? 'retry' : 'finish' }
  }
withName:FASTQC {  // seems to fail when using lower numbers of cores
    cpus = { 8 * task.attempt }
    memory = { 30.GB * task.attempt }
    time = { 4.h * task.attempt }
    errorStrategy = { task.attempt <= 4 ? 'retry' : 'finish' }
  }
    withName:VCFTOOLS_TSTV_COUNT { 
    cpus = { 8 * task.attempt }
    memory = { 30.GB * task.attempt }
    time = { 4.h * task.attempt }
    errorStrategy = 'ignore'
  }
  }

executor {
    $pbspro {
        queueSize = 49
        submitRateLimit = '10 sec'
    }

    $local {
        cpus = 2
        queueSize = 1
        memory = '6 GB'
    }
}

singularity {
    enabled = true
    autoMounts = true
    runOptions = "-B /rds:/rds,/etc:/etc,/rds/general/user/$USER/ephemeral/tmp:/tmp,/var/tmp:/var/tmp"
}
amizeranschi commented 1 year ago

I'm running into a similar issue with VCFTOOLS_TSTV_COUNT on joint_germline.vcf.gz. Can confirm that adding errorStrategy = 'ignore' in my config got the pipeline to finish successfully.

C2i-PeterChung commented 6 months ago

I am new to nextflow and I have the same problem and I follow the step by adding the config file like below:

touch nexflow.config
nano nextflow.config

process {
    withName:VCFTOOLS_TSTV_COUNT {  
    errorStrategy = 'ignore'
  }
}

nextflow -bg run nf-core/sarek -r 3.4.0 -params-file params.json -profile docker -c nextflow.config

and I place the config file locally and I ran the code but the pipeline seems cannot read my config file

Core Nextflow options revision : 3.4.0 runName : tender_mccarthy containerEngine : docker launchDir : /data/run3 workDir : /data/run3/work projectDir : /root/.nextflow/assets/nf-core/sarek userName : root profile : docker configFiles :

The config files are empty.

tavareshugo commented 5 months ago

Just to report that I got the same error doing joint germline calling on human samples (3 individuals from public "genome in a bottle" data). The hack of ignoring the error is a workaround, but the error is still there.

I tried to troubleshoot a bit:

Possibly this is related to this open issue on the vcftools repo.