peterk87 / nf-flu

Influenza genome analysis Nextflow workflow
MIT License
22 stars 20 forks source link

[BUG]: Error in VADR_IRMA process #22

Closed EricFournier3 closed 4 days ago

EricFournier3 commented 3 weeks ago

Is there an existing issue for this?

Description of the Bug/Issue

Hi,

would it be possible to add errorStrategy 'ignore' in the VADR process and produce the samples output as it was in the previous version (consensus, reports, images, etc). Because when this step fail for one sample, the whole pipeline stop. We are using nf-flu in our local in-house Influenza global pipeline. And when it fail this way, all other samples are not processed

For your info, this problematic sample (L00955952004) was perfectly processed with the previous version of nf-flu (without the VADR process)

Thanks

Nextflow command-line

nextflow   run CFIA-NCFAD/nf-flu -c slurm.config --input $samplesheet.csv -profile singularity,slurm --platform illumina --outdir myoutdir

Error Message

Pipeline execution summary
  ---------------------------
  Completed at : 2024-10-11T11:41:08.780879272-04:00
  Duration     : 1m 38s
  Success      : false
  Results Dir  : /data/devel/nf-flu/results/run_nextseq_14
  Work Dir     : /data/devel/nf-flu/work
executor >  slurm (9), local (1)
[7b/0cb464] process > NF_FLU:ILLUMINA:CHECK_SAMPLE_SHEET (1)                            [100%] 1 of 1 ✔
[-        ] process > NF_FLU:ILLUMINA:READ_COUNT_FAIL_TSV                               -
[b9/09699c] process > NF_FLU:ILLUMINA:READ_COUNT_PASS_TSV                               [100%] 1 of 1 ✔
[78/10c246] process > NF_FLU:ILLUMINA:ZSTD_DECOMPRESS_FASTA                             [100%] 1 of 1 ✔
[9c/8fa54f] process > NF_FLU:ILLUMINA:ZSTD_DECOMPRESS_CSV                               [100%] 1 of 1 ✔
[da/31272a] process > NF_FLU:ILLUMINA:BLAST_MAKEBLASTDB_NCBI (41415330-influenza.fasta) [100%] 1 of 1 ✔
[62/03a47d] process > NF_FLU:ILLUMINA:SETUP_FLU_VADR_MODEL                              [100%] 1 of 1 ✔
[70/e1ed01] process > NF_FLU:ILLUMINA:CAT_ILLUMINA_FASTQ (L00955952004)                 [100%] 1 of 1 ✔
[f1/3aac6a] process > NF_FLU:ILLUMINA:IRMA (L00955952004)                               [100%] 1 of 1 ✔
[22/dd0595] process > NF_FLU:ILLUMINA:VADR_IRMA (L00955952004)                          [100%] 1 of 1, failed: 1 ✘
[-        ] process > NF_FLU:ILLUMINA:VADR_SUMMARIZE_ISSUES_IRMA                        -
[-        ] process > NF_FLU:ILLUMINA:PRE_TABLE2ASN_IRMA                                -
[-        ] process > NF_FLU:ILLUMINA:TABLE2ASN_IRMA                                    -
[-        ] process > NF_FLU:ILLUMINA:POST_TABLE2ASN_IRMA                               -
[14/df54fa] process > NF_FLU:ILLUMINA:BLAST_BLASTN_IRMA (L00955952004)                  [100%] 1 of 1, failed: 1 ✘
[-        ] process > NF_FLU:ILLUMINA:SUBTYPING_REPORT_IRMA_CONSENSUS                   -
[-        ] process > NF_FLU:ILLUMINA:PULL_TOP_REF_ID                                   -
[-        ] process > NF_FLU:ILLUMINA:SEQTK_SEQ                                         -
[-        ] process > NF_FLU:ILLUMINA:MINIMAP2                                          -
[-        ] process > NF_FLU:ILLUMINA:MOSDEPTH_GENOME                                   -
[-        ] process > NF_FLU:ILLUMINA:FREEBAYES                                         -
[-        ] process > NF_FLU:ILLUMINA:BCF_FILTER_FREEBAYES                              -
[-        ] process > NF_FLU:ILLUMINA:VCF_FILTER_FRAMESHIFT                             -
[-        ] process > NF_FLU:ILLUMINA:BCFTOOLS_STATS                                    -
[-        ] process > NF_FLU:ILLUMINA:COVERAGE_PLOT                                     -
[-        ] process > NF_FLU:ILLUMINA:BCF_CONSENSUS                                     -
[-        ] process > NF_FLU:ILLUMINA:CAT_CONSENSUS                                     -
[-        ] process > NF_FLU:ILLUMINA:VADR_BCFTOOLS                                     -
[-        ] process > NF_FLU:ILLUMINA:VADR_SUMMARIZE_ISSUES_BCFTOOLS                    -
[-        ] process > NF_FLU:ILLUMINA:PRE_TABLE2ASN_BCFTOOLS                            -
[-        ] process > NF_FLU:ILLUMINA:TABLE2ASN_BCFTOOLS                                -
[-        ] process > NF_FLU:ILLUMINA:POST_TABLE2ASN_BCFTOOLS                           -
[-        ] process > NF_FLU:ILLUMINA:BLAST_BLASTN_CONSENSUS                            -
[-        ] process > NF_FLU:ILLUMINA:SUBTYPING_REPORT_BCF_CONSENSUS                    -
[-        ] process > NF_FLU:ILLUMINA:MQC_VERSIONS_TABLE                                -
[-        ] process > NF_FLU:ILLUMINA:MULTIQC                         

Failed to fetch subseq: Requested start 1 isn't in the sequence L00955952004_1 at /opt/vadr/Bio-Easel/blib/lib/Bio/Easel/SqFile.pm line 715.
Error executing process > 'NF_FLU:ILLUMINA:VADR_IRMA (L00955952004)'

Caused by:
  Process `NF_FLU:ILLUMINA:VADR_IRMA (L00955952004)` terminated with an error exit status (2)

Command executed:

  v-annotate.pl \
    --mkey flu -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --noseqnamemax \
    --mdir vadr-model \
    L00955952004.irma.consensus.fasta \
    L00955952004

  cat <<-END_VERSIONS > versions.yml
  "NF_FLU:ILLUMINA:VADR_IRMA":
      vadr: $(v-annotate.pl -h | perl -ne 'print "$1\n" if /^# VADR (\d+\.\d+\.\d+)/')
  END_VERSIONS

Command exit status:
  2

Command output:
  # v-annotate.pl :: classify and annotate sequences using a model library
  # VADR 1.6.3 (Dec 2023)
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  # date:              Thu Oct 10 21:45:29 2024
  # $VADRBIOEASELDIR:  /opt/vadr/Bio-Easel
  # $VADRBLASTDIR:     /opt/vadr/ncbi-blast/bin
  # $VADREASELDIR:     /opt/vadr/infernal/binaries
  # $VADRINFERNALDIR:  /opt/vadr/infernal/binaries
  # $VADRMODELDIR:     /opt/vadr/vadr-models
  # $VADRSCRIPTSDIR:   /opt/vadr/vadr
  #
  # sequence file:                                                                  L00955952004.irma.consensus.fasta
  # output directory:                                                               L00955952004
  # only consider ATG a valid start codon:                                          yes [--atgonly]
  # specify that alert codes in <s> cause FAILure:                                  extrant5,extrant3 [--alt_fail]
  # .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  flu [--mkey]
  # model files are in directory <s>, not in $VADRMODELDIR:                         vadr-model [--mdir]
  # in feature table for failed seqs, never change feature type to misc_feature:    yes [--nomisc]
  # turn off composition-based for blastx statistics with -comp_based_stats 0:      yes [--xnocomp]
  # replace stretches of Ns with expected nts, where possible:                      yes [-r]
  # do not enforce a maximum length of 50 for sequence names (GenBank max):         yes [--noseqnamemax]
  # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  # Validating input

Workflow Version

3.5.1 revision: f18f8ce53d [master]

Nextflow Executor

slurm

Nextflow Version

version 22.10.7

Java Version

java version "17.0.8" 2023-07-18 LTS Java(TM) SE Runtime Environment (build 17.0.8+9-LTS-211) Java HotSpot(TM) 64-Bit Server VM (build 17.0.8+9-LTS-211, mixed mode, sharing)

Hardware

cluster

Operating System (OS)

CentOS Linux release 7.9.2009 (Core)

Conda/Container Engine

None

Additional context

nextflow.log

peterk87 commented 3 weeks ago

Hi @EricFournier3 thanks for the bug report! You're totally right that this shouldn't cause the pipeline to error out and the error should be ignored with the VADR output being optional in most cases.

In the meantime, you might be able to use a custom config to set errorStrategy = 'ignore':

process {
  withName: "VADR_.*" {
    errorStrategy = 'ignore'
  }
}

Hope that helps while the issue is fixed in the workflow.

EricFournier3 commented 2 weeks ago

Hi @peterk87 , this time the pipeline failed on NF_FLU:ILLUMINA:BLAST_BLASTN_IRMA

BLAST engine error: Warning: Sequence contains no data Warning: Sequence contains no data Warning: Sequence contains no data Warning: Sequence contains no data Warning: Sequence contains no data Warning: Sequence contains no data Warning: Sequence contains no data Warning: Sequence contains no data
[b5/bf4bde] NOTE: Process `NF_FLU:ILLUMINA:VADR_IRMA (L00955952004)` terminated with an error exit status (2) -- Error is ignored
Error executing process > 'NF_FLU:ILLUMINA:BLAST_BLASTN_IRMA (L00955952004)'

Caused by:
  Process `NF_FLU:ILLUMINA:BLAST_BLASTN_IRMA (L00955952004)` terminated with an error exit status (3)

Command executed:

  DB=`find -L ./ -name "*.ndb" | sed 's/.ndb//'`
  blastn \
      -num_threads 8 \
      -db $DB \
      -query L00955952004.irma.consensus.fasta \
      -outfmt "6 qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs stitle" -num_alignments 1000000 -evalue 1e-6 \
      -out L00955952004.blastn.txt
  cat <<-END_VERSIONS > versions.yml
  "NF_FLU:ILLUMINA:BLAST_BLASTN_IRMA":
      blast: $(blastn -version 2>&1 | sed 's/^.*blastn: //; s/ .*$//')
  END_VERSIONS

Is there a way to revert to a previous version without VADR?

I tried with nextflow pull nextflow pull CFIA-NCFAD/nf-flu -r b0d6575b6d

but it didn't works

Checking nextflow ...
WARN: Cannot read project manifest -- Cause: Remote resource not found: https://api.github.com/repos/nextflow-io/nextflow/contents/nextflow.config?ref=b0d6575b6d
Remote resource not found: https://api.github.com/repos/nextflow-io/nextflow/contents/main.nf?ref=b0d6575b6d

nextflow.log

Thanks

EricFournier3 commented 2 weeks ago

Capture9 I also notice on the GitHub front page that revisions are for nf-iav-illumina instead of nf-flu. I am a little confused about this

peterk87 commented 2 weeks ago

Hi @EricFournier3 is L00955952004_1 an empty sequence? I am trying to reproduce the issue and come up with a fix. If I provide an empty sequence (just a header), e.g.

>empty_seq

I get a similar issue.

$ v-annotate.pl --mkey flu -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --noseqnamemax --mdir vadr-model empty.fa empty
# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.6.4 (Jun 2024)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Wed Oct 16 14:51:05 2024
# $VADRBIOEASELDIR:  /home/pkruczkiewicz/miniforge3/envs/vadr/bin
# $VADRBLASTDIR:     /home/pkruczkiewicz/miniforge3/envs/vadr/bin
# $VADREASELDIR:     /home/pkruczkiewicz/miniforge3/envs/vadr/bin
# $VADRINFERNALDIR:  /home/pkruczkiewicz/miniforge3/envs/vadr/bin
# $VADRMODELDIR:     /home/pkruczkiewicz/miniforge3/envs/vadr/share/vadr-1.6.4/vadr-models
# $VADRSCRIPTSDIR:   /home/pkruczkiewicz/miniforge3/envs/vadr/share/vadr-1.6.4/vadr
#
# sequence file:                                                                  empty.fa
# output directory:                                                               empty
# only consider ATG a valid start codon:                                          yes [--atgonly]
# specify that alert codes in <s> cause FAILure:                                  extrant5,extrant3 [--alt_fail]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  flu [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         vadr-model [--mdir]
# in feature table for failed seqs, never change feature type to misc_feature:    yes [--nomisc]
# turn off composition-based for blastx statistics with -comp_based_stats 0:      yes [--xnocomp]
# replace stretches of Ns with expected nts, where possible:                      yes [-r]
# do not enforce a maximum length of 50 for sequence names (GenBank max):         yes [--noseqnamemax]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Validating input                                                                        ... Failed to fetch subseq: Requested start 1 isn't in the sequence empty at /home/pkruczkiewicz/miniforge3/envs/vadr/lib/perl5/5.32/site_perl/Bio/Easel/SqFile.pm line 713.

IRMA may be producing empty consensus FASTA files and those should be ignored for downstream processing.

I'm working on a patch release to try to address this issue.


To revert back to a previous version, you can specify the release tag from https://github.com/CFIA-NCFAD/nf-flu/releases

e.g. run the nf-flu version 3.3.10 before VADR was added

nextflow run CFIA-NCFAD/nf-flu -r 3.3.10 ...

The "official" repo is under the CFIA-NCFAD org at https://github.com/CFIA-NCFAD/nf-flu/

I forked when I should have transferred the repo, and I'm afraid of breaking things now by trying to transfer.

Sorry about the confusion! I'll try to make it more clear in the README.

EricFournier3 commented 2 weeks ago

great, thank you @peterk87 . I will revert to version 3.3.10 in the meantime

peterk87 commented 2 weeks ago

Hi @EricFournier3, this issue should be fixed in 3.5.2

nextflow pull CFIA-NCFAD/nf-flu
nextflow run CFIA-NCFAD/nf-flu -r 3.5.2 \
  -c slurm.config \
  --input $samplesheet.csv \
  -profile singularity,slurm \
  --platform illumina \
  --outdir myoutdir

Let me know if the new release fixes your issue!

EricFournier3 commented 1 week ago

Hi @peterk87 ,

we now have this error with 3.5.2
N E X T F L O W  ~  version 22.10.7
Launching `https://github.com/CFIA-NCFAD/nf-flu` [gloomy_bartik] DSL2 - revision: 10bb2e19cb [3.5.2]
Core Nextflow options
    revision                  : 3.5.2
    runName                   : gloomy_bartik
    containerEngine           : singularity
    launchDir                 : /data/devel/nf-flu
    workDir                   : /data/devel/nf-flu/work
    projectDir                : /data/devel/nf-flu/nextflow-home/.nextflow/assets/CFIA-NCFAD/nf-flu
    userName                  : foueri01@inspq.qc.ca
    profile                   : singularity,slurm
    configFiles               : /data/devel/nf-flu/nextflow-home/.nextflow/assets/CFIA-NCFAD/nf-flu/nextflow.config, /data/devel/nf-flu/scripts/slurm.config

Input/output options
    input                     : /data/devel/nf-flu/samplesheet/samplesheet_run_nextseq_14.csv
    platform                  : illumina
    outdir                    : /data/devel/nf-flu/results/run_nextseq_14

IRMA assembly options
    keep_ref_deletions        : true
    skip_irma_subtyping_report: true

Annotation options
    vadr_model_targz          : https://ftp.ncbi.nlm.nih.gov/pub/nawrocki/vadr-models/flu/1.6.3-2/vadr-models-flu-1.6.3-2.tar.gz

Max job request options
    max_memory                : 32 GB

[Only displaying parameters that differ from pipeline default]
------------------------------------------------------

------------------------------------------------------
executor >  slurm (13), local (1)
[c2/3b1e9d] process > NF_FLU:ILLUMINA:CHECK_SAMPLE_SHEET (1)                            [100%] 1 of 1 ✔
[-        ] process > NF_FLU:ILLUMINA:READ_COUNT_FAIL_TSV                               -
[ac/7ca4db] process > NF_FLU:ILLUMINA:READ_COUNT_PASS_TSV                               [100%] 1 of 1 ✔
[42/be1b92] process > NF_FLU:ILLUMINA:ZSTD_DECOMPRESS_FASTA                             [100%] 1 of 1 ✔
[75/ccc80b] process > NF_FLU:ILLUMINA:ZSTD_DECOMPRESS_CSV                               [100%] 1 of 1 ✔
[05/565036] process > NF_FLU:ILLUMINA:BLAST_MAKEBLASTDB_NCBI (41415330-influenza.fasta) [100%] 1 of 1 ✔
[1d/ba4bca] process > NF_FLU:ILLUMINA:SETUP_FLU_VADR_MODEL                              [100%] 1 of 1 ✔
[1d/c5914d] process > NF_FLU:ILLUMINA:CAT_ILLUMINA_FASTQ (L00955952004)                 [100%] 1 of 1 ✔
[65/c0d7d6] process > NF_FLU:ILLUMINA:IRMA (L00955952004)                               [100%] 1 of 1 ✔
[b9/90b024] process > NF_FLU:ILLUMINA:VADR_IRMA (L00955952004)                          [100%] 1 of 1, failed: 1 ✔
[-        ] process > NF_FLU:ILLUMINA:VADR_SUMMARIZE_ISSUES_IRMA                        -
[-        ] process > NF_FLU:ILLUMINA:PRE_TABLE2ASN_IRMA                                -
[-        ] process > NF_FLU:ILLUMINA:TABLE2ASN_IRMA                                    -
[-        ] process > NF_FLU:ILLUMINA:POST_TABLE2ASN_IRMA                               -
[0a/9ac84e] process > NF_FLU:ILLUMINA:BLAST_BLASTN_IRMA (L00955952004)                  [100%] 1 of 1 ✔
[67/e16ac4] process > NF_FLU:ILLUMINA:SUBTYPING_REPORT_IRMA_CONSENSUS (1)               [ 50%] 1 of 2, failed: 1, retries: 1
[5e/bfd8a0] process > NF_FLU:ILLUMINA:PULL_TOP_REF_ID (L00955952004)                    [ 50%] 1 of 2, failed: 1, retries: 1
[-        ] process > NF_FLU:ILLUMINA:SEQTK_SEQ                                         -
[-        ] process > NF_FLU:ILLUMINA:MINIMAP2                                          -
[-        ] process > NF_FLU:ILLUMINA:MOSDEPTH_GENOME                                   -
[-        ] process > NF_FLU:ILLUMINA:FREEBAYES                                         -
[-        ] process > NF_FLU:ILLUMINA:BCF_FILTER_FREEBAYES                              -
[-        ] process > NF_FLU:ILLUMINA:VCF_FILTER_FRAMESHIFT                             -
[-        ] process > NF_FLU:ILLUMINA:BCFTOOLS_STATS                                    -
[-        ] process > NF_FLU:ILLUMINA:COVERAGE_PLOT                                     -
[-        ] process > NF_FLU:ILLUMINA:BCF_CONSENSUS                                     -
[-        ] process > NF_FLU:ILLUMINA:CAT_CONSENSUS                                     -
[-        ] process > NF_FLU:ILLUMINA:VADR_BCFTOOLS                                     -
[-        ] process > NF_FLU:ILLUMINA:VADR_SUMMARIZE_ISSUES_BCFTOOLS                    -
[-        ] process > NF_FLU:ILLUMINA:PRE_TABLE2ASN_BCFTOOLS                            -
[-        ] process > NF_FLU:ILLUMINA:TABLE2ASN_BCFTOOLS                                -
[-        ] process > NF_FLU:ILLUMINA:POST_TABLE2ASN_BCFTOOLS                           -
[-        ] process > NF_FLU:ILLUMINA:BLAST_BLASTN_CONSENSUS                            -
[-        ] process > NF_FLU:ILLUMINA:SUBTYPING_REPORT_BCF_CONSENSUS                    -
[-        ] process > NF_FLU:ILLUMINA:MQC_VERSIONS_TABLE                                -
[-        ] process > NF_FLU:ILLUMINA:MULTIQC                                           -
[b9/90b024] NOTE: Process `NF_FLU:ILLUMINA:VADR_IRMA (L00955952004)` terminated with an error exit status (1) -- Error is ignored
[6c/b7fa83] NOTE: Process `NF_FLU:ILLUMINA:PULL_TOP_REF_ID (L00955952004)` terminated with an error exit status (1) -- Execution is retried (1)
[cc/9b43a8] NOTE: Process `NF_FLU:ILLUMINA:SUBTYPING_REPORT_IRMA_CONSENSUS (1)` terminated with an error exit status (1) -- Execution is retried (1)
Error executing process > 'NF_FLU:ILLUMINA:PULL_TOP_REF_ID (L00955952004)'

Caused by:
  Process `NF_FLU:ILLUMINA:PULL_TOP_REF_ID (L00955952004)` terminated with an error exit status (1)

Command executed:

  parse_influenza_blast_results.py \
    --flu-metadata 41415333-influenza.csv \
    --get-top-ref True \
    --top 1 \
    --pident-threshold 0.85 \
    --sample-name L00955952004 \
    L00955952004.blastn.txt

  cat <<-END_VERSIONS > versions.yml
  "NF_FLU:ILLUMINA:PULL_TOP_REF_ID":
     python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

.nextflow.log

peterk87 commented 1 week ago

That's really strange. Are the IRMA consensus sequences completely empty? Do they look anomalous to you?

Would it be possible to share the input files in some way so I could do some more in-depth debugging?

EricFournier3 commented 1 week ago

I sent you a OneDrive link at Peter.Kruczkiewicz@inspection.gc.ca

peterk87 commented 5 days ago

Hi @EricFournier3 would you be able to try sending the link again? Your email might have been blocked by spam filters.

EricFournier3 commented 5 days ago

Hi @peterk87 , yes I just sent to you again Capture10

peterk87 commented 4 days ago

Hi @EricFournier3 thanks for sending over the Illumina reads. The issue should be fixed in 3.5.3. You can try it out with

nextflow pull CFIA-NCFAD/nf-flu -r master
nextflow run CFIA-NCFAD/nf-flu -r 3.5.3 ....