nf-core / viralrecon

Assembly and intrahost/low-frequency variant calling for viral samples
https://nf-co.re/viralrecon
MIT License
117 stars 107 forks source link

Error at the PlasmidID step : gff_to_bed.sh #238

Closed stevin-wilson closed 2 years ago

stevin-wilson commented 2 years ago

Error at the PlasmidID step : gff_to_bed.sh

Hi

I am trying to run viral recon pipeline with assembly enabled, and has been encountering an error at the PlasmidID step. it would be very helpful if I could get a solution to resolve the error which is described in detail below. Thank you !

viralrecon was run using the following command:

nextflow run nf-core/viralrecon \
    --input samplesheet.csv \
    --platform illumina \
    --protocol amplicon \
    --genome 'MN908947.3' \
    --primer_set artic \
    --primer_set_version 3 \
    --min_mapped_reads 15000 \
    --kraken2_assembly_host_filter \
    --blast_db betacoronavirus_BLAST \
    -profile singularity

The contents of the nextflow.config file in the working directory are as follows:

params {
    // Max resource options
    // Defaults only, expecting to be overwritten
    max_memory                   = '764.GB'
    max_cpus                     = 36
    max_time                     = '335.h'
}

I get the following error message (seems to be from prokka)


  Command wrapper:

  CHECKING DEPENDENCIES AND MANDATORY FILES

  DEPENDENCY                  STATUS
  ----------                  ------
  blastn                       INSTALLED 
  prokka                       INSTALLED 
  circos                       INSTALLED 
  mash                         INSTALLED 
  gawk                         INSTALLED 

  SCREENING READS WITH KMERS (Tue Oct 26 10:45:30 EDT 2021)
   Reads will be screened against database supplied for further filtering and mapping,
   this will reduce the input sequences to map against SAMPLE_IDENTIFIER

  CLUSTERING SEQUENCES BY KMER DISTANCE (Tue Oct 26 10:45:30 EDT 2021)
   Sequences obtained after screen will be clustered to reduce redundancy,
   one representative, the largest, will be considered for further analysis SAMPLE_IDENTIFIER

  -------------------------
  #Pipeline reconstruction#
  -------------------------
  Contigs                                SAMPLE_IDENTIFIER.scaffolds.fa
  Will be aligned to                     database.filtered_0.80_term.0.5.representative.fasta
  That contains                          1 plasmids
  And each contig aligned more than      20 %
  and have at least                      60 % identity
  Will be represented and annotated       

  ANNOTATING CONTIGS (Tue Oct 26 10:45:33 EDT 2021)
   A file including all automatic annotations on contigs will be generated.

  ALIGNING CONTIGS TO FILTERED PLASMIDS (Tue Oct 26 10:45:37 EDT 2021)
   Contigs are aligned to filtered plasmids and those are selected by alignment identity and alignment percentage in order to create links, full length and annotation tracks

  ---------------------------------------

  ERROR in Script plasmidID on or near line 868; exiting with status 1
  MESSAGE:

  See ./logs/plasmidID.log for more information.
  command:
  gff_to_bed.sh -i ./NO_GROUP/SAMPLE_IDENTIFIER/data/SAMPLE_IDENTIFIER".gff" -L

  ---------------------------------------

Work dir:
  /2021_10_25_run/work/1b/34f24b04239cd078bfd503161f3f62

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

At , the content of the log file at /2021_10_25_run/work/1b/34f24b04239cd078bfd503161f3f62/NO_GROUP/SAMPLE_IDENTIFIER/logs/plasmidID.log are as follows:

LOG FILE PLASMIDID
Tue Oct 26 10:45:30 EDT 2021

#Executing /usr/local/bin/mash_screener.sh 

DEPENDENCY                STATUS
----------                ------
bash                         INSTALLED 
mash                         INSTALLED 
Output directory is ./NO_GROUP/SAMPLE_IDENTIFIER/kmer
creating sketch of  nCoV-2019.reference.fasta
Sketching nCoV-2019.reference.fasta...
Writing to ./NO_GROUP/SAMPLE_IDENTIFIER/kmer/database.msh...
Tue Oct 26 10:45:30 EDT 2021
screening SAMPLE_IDENTIFIER.scaffolds.fa
Loading ./NO_GROUP/SAMPLE_IDENTIFIER/kmer/database.msh...
   1000 distinct hashes.
Streaming from SAMPLE_IDENTIFIER.scaffolds.fa...
   Estimated distinct k-mers in mixture: 31081
Summing shared...
Reallocating to winners...
Computing coverage medians...
Writing output...
Tue Oct 26 10:45:30 EDT 2021
DONE Screening SAMPLE_IDENTIFIER of NO_GROUP Group 

Retrieving sequences matching more than 0.80 identity

#Executing /usr/local/bin/filter_fasta.sh 

Output directory is ./NO_GROUP/SAMPLE_IDENTIFIER/kmer
Tue Oct 26 10:45:30 EDT 2021
Filtering terms on file nCoV-2019.reference.fasta
Tue Oct 26 10:45:30 EDT 2021
DONE Filtering terms on file nCoV-2019.reference.fasta
File with filtered sequences can be found in ./NO_GROUP/SAMPLE_IDENTIFIER/kmer/database.filtered_0.80_term.fasta
Previous number of sequences= 1
Post number of sequences= 1

Namespace(distance=0.5, input_file='./NO_GROUP/SAMPLE_IDENTIFIER/kmer/database.filtered_0.80_term.fasta', output=False, output_grouped=False)
Obtaining mash distance
Obtaining cluster from distance
Calculating length
Filtering representative fasta
1 sequences clustered into 1
DONE

#Executing /usr/local/bin/calculate_seqlen.sh 

Tue Oct 26 10:45:33 EDT 2021
Done seqlen calculation
Files can be found at ./NO_GROUP/SAMPLE_IDENTIFIER/data

#Executing /usr/local/bin/build_karyotype.sh 

(standard_in) 1: syntax error
(standard_in) 1: syntax error
FILE NAME SAMPLE_IDENTIFIER
Tue Oct 26 10:45:33 EDT 2021
Obtain list of cromosomes (idiogram) for CIRCOS karyotype file
Generating summary karyotype file with plasmids that mapped more than %
Tue Oct 26 10:45:33 EDT 2021
Done Obtain list of cromosomes (idiogram) for CIRCOS karyotype file
Files can be found at ./NO_GROUP/SAMPLE_IDENTIFIER/data
1 sequences will be displayed on summary image
1 images will be created individually 

#Executing /usr/local/bin/prokka_annotation.sh 

DEPENDENCY                STATUS
----------                ------
prokka                       INSTALLED 
PREFIX SAMPLE_IDENTIFIER
Output directory is ./NO_GROUP/SAMPLE_IDENTIFIER/data
Tue Oct 26 10:45:33 EDT 2021
Annotating SAMPLE_IDENTIFIER.scaffolds.fa with prokka
[10:45:33] This is prokka 1.14.6
[10:45:33] Written by Torsten Seemann <torsten.seemann@gmail.com>
[10:45:33] Homepage is https://github.com/tseemann/prokka
[10:45:33] Local time is Tue Oct 26 10:45:33 2021
[10:45:33] You are not telling me who you are!
[10:45:33] Operating system is linux
[10:45:33] You have BioPerl 1.007002
[10:45:33] System has 80 cores.
[10:45:33] Will use maximum of 1 cores.
[10:45:33] Annotating as >>> Bacteria <<<
[10:45:33] Re-using existing --outdir ./NO_GROUP/SAMPLE_IDENTIFIER/data
[10:45:33] Using filename prefix: SAMPLE_IDENTIFIER.XXX
[10:45:33] Setting HMMER_NCPU=1
[10:45:33] Writing log to: ./NO_GROUP/SAMPLE_IDENTIFIER/data/SAMPLE_IDENTIFIER.log
[10:45:33] Command: /usr/local/bin/prokka --force --outdir ./NO_GROUP/SAMPLE_IDENTIFIER/data --prefix SAMPLE_IDENTIFIER --addgenes --kingdom Bacteria --genus --species --usegenus --centre BU-ISCIII --locustag SAMPLE_IDENTIFIER --addgenes --cpus 1 SAMPLE_IDENTIFIER.scaffolds.fa
[10:45:33] Appending to PATH: /usr/local/bin
[10:45:33] Looking for 'aragorn' - found /usr/local/bin/aragorn
[10:45:33] Determined aragorn version is 001002 from 'ARAGORN v1.2.38 Dean Laslett'
[10:45:33] Looking for 'barrnap' - found /usr/local/bin/barrnap
[10:45:33] Determined barrnap version is 000009 from 'barrnap 0.9'
[10:45:33] Looking for 'blastp' - found /usr/local/bin/blastp
[10:45:35] Determined blastp version is 002011 from 'blastp: 2.11.0+'
[10:45:35] Looking for 'cmpress' - found /usr/local/bin/cmpress
[10:45:35] Determined cmpress version is 001001 from '# INFERNAL 1.1.4 (Dec 2020)'
[10:45:35] Looking for 'cmscan' - found /usr/local/bin/cmscan
[10:45:35] Determined cmscan version is 001001 from '# INFERNAL 1.1.4 (Dec 2020)'
[10:45:35] Looking for 'egrep' - found /bin/egrep
[10:45:35] Looking for 'find' - found /usr/bin/find
[10:45:35] Looking for 'grep' - found /bin/grep
[10:45:35] Looking for 'hmmpress' - found /usr/local/bin/hmmpress
[10:45:35] Determined hmmpress version is 003003 from '# HMMER 3.3.2 (Nov 2020); http://hmmer.org/'
[10:45:35] Looking for 'hmmscan' - found /usr/local/bin/hmmscan
[10:45:35] Determined hmmscan version is 003003 from '# HMMER 3.3.2 (Nov 2020); http://hmmer.org/'
[10:45:35] Looking for 'java' - found /usr/local/bin/java
[10:45:35] Looking for 'makeblastdb' - found /usr/local/bin/makeblastdb
[10:45:35] Determined makeblastdb version is 002011 from 'makeblastdb: 2.11.0+'
[10:45:35] Looking for 'minced' - found /usr/local/bin/minced
[10:45:36] Determined minced version is 004002 from 'minced 0.4.2'
[10:45:36] Looking for 'parallel' - found /usr/local/bin/parallel
[10:45:36] Determined parallel version is 20210222 from 'GNU parallel 20210222'
[10:45:36] Looking for 'prodigal' - found /usr/local/bin/prodigal
[10:45:36] Determined prodigal version is 002006 from 'Prodigal V2.6.3: February, 2016'
[10:45:36] Looking for 'prokka-genbank_to_fasta_db' - found /usr/local/bin/prokka-genbank_to_fasta_db
[10:45:36] Looking for 'sed' - found /bin/sed
[10:45:36] Looking for 'tbl2asn' - found /usr/local/bin/tbl2asn
[10:45:37] Determined tbl2asn version is 025007 from 'tbl2asn 25.7   arguments:'
[10:45:37] Using genetic code table 11.
[10:45:37] Loading and checking input file: SAMPLE_IDENTIFIER.scaffolds.fa
[10:45:37] Wrote 9 contigs totalling 30319 bp.
[10:45:37] Predicting tRNAs and tmRNAs
[10:45:37] Running: aragorn -l -gc11  -w \.\/NO_GROUP\/SAMPLE_IDENTIFIER\/data\/SAMPLE_IDENTIFIER\.fna
[10:45:37] Found 0 tRNAs
[10:45:37] Predicting Ribosomal RNAs
[10:45:37] Running Barrnap with 1 threads
[10:45:37] Found 0 rRNAs
[10:45:37] Skipping ncRNA search, enable with --rfam if desired.
[10:45:37] Total of 0 tRNA + rRNA features
[10:45:37] Searching for CRISPR repeats
[10:45:37] Found 0 CRISPRs
[10:45:37] Predicting coding sequences
[10:45:37] Contigs total 30319 bp, so using meta mode
[10:45:37] Running: prodigal -i \.\/NO_GROUP\/SAMPLE_IDENTIFIER\/data\/SAMPLE_IDENTIFIER\.fna -c -m -g 11 -p meta -f sco -q
[10:45:37] Found 12 CDS
[10:45:37] Connecting features back to sequences
[10:45:37] Skipping genus-specific proteins as can't see /usr/local/db/--species
[10:45:37] Annotating CDS, please be patient.
[10:45:37] Will use 1 CPUs for similarity searching.
[10:45:37] There are still 12 unannotated CDS left (started with 12)
[10:45:37] Will use blast to search against /usr/local/db/kingdom/Bacteria/IS with 1 CPUs
[10:45:37] Running: cat \.\/NO_GROUP\/SAMPLE_IDENTIFIER\/data\/SAMPLE_IDENTIFIER\.IS\.tmp\.2470808\.faa | parallel --gnu --plain -j 1 --block 3028 --recstart '>' --pipe blastp -query - -db /usr/local/db/kingdom/Bacteria/IS -evalue 1e-30 -qcov_hsp_perc 90 -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no > \.\/NO_GROUP\/SAMPLE_IDENTIFIER\/data\/SAMPLE_IDENTIFIER\.IS\.tmp\.2470808\.blast 2> /dev/null
[10:45:37] Could not run command: cat \.\/NO_GROUP\/SAMPLE_IDENTIFIER\/data\/SAMPLE_IDENTIFIER\.IS\.tmp\.2470808\.faa | parallel --gnu --plain -j 1 --block 3028 --recstart '>' --pipe blastp -query - -db /usr/local/db/kingdom/Bacteria/IS -evalue 1e-30 -qcov_hsp_perc 90 -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no > \.\/NO_GROUP\/SAMPLE_IDENTIFIER\/data\/SAMPLE_IDENTIFIER\.IS\.tmp\.2470808\.blast 2> /dev/null
Tue Oct 26 10:45:37 EDT 2021
done annotating SAMPLE_IDENTIFIER.scaffolds.fa with prokka
Removing unwanted files

#Executing /usr/local/bin/blast_align.sh 

query type selected as nucl
Output directory is ./NO_GROUP/SAMPLE_IDENTIFIER/data
filename is SAMPLE_IDENTIFIER
Tue Oct 26 10:45:37 EDT 2021
Blasting SAMPLE_IDENTIFIER agaist database.filtered_0.80_term.0.5.representative.fasta

Building a new DB, current time: 10/26/2021 10:45:37
New DB name:   /2021_10_25_run/work/1b/34f24b04239cd078bfd503161f3f62/NO_GROUP/SAMPLE_IDENTIFIER/kmer/database.filtered_0.80_term.0.5.representative.fasta.blast.tmp
New DB title:  ./NO_GROUP/SAMPLE_IDENTIFIER/kmer/database.filtered_0.80_term.0.5.representative.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.000415802 seconds.

BLAST command is blastn
Tue Oct 26 10:45:38 EDT 2021
Done blasting SAMPLE_IDENTIFIER agaist database.filtered_0.80_term.0.5.representative.fasta
blasted file can be found in ./NO_GROUP/SAMPLE_IDENTIFIER/data/SAMPLE_IDENTIFIER.plasmids.blast 

#Executing /usr/local/bin/blast_to_bed.sh 

Tue Oct 26 10:45:38 EDT 2021
Adapting blast to bed using SAMPLE_IDENTIFIER.plasmids.blast with:
Blast identity= 60
Min length aligned= 500
Min len percentage= 0
database_delimiter= -
database_field)= length(database_name)
query_delimiter= _
query_field= length(query_name)
Tue Oct 26 10:45:38 EDT 2021
DONE adapting blast to bed
File can be found at ./NO_GROUP/SAMPLE_IDENTIFIER/data/SAMPLE_IDENTIFIER.plasmids.bed 

#Executing /usr/local/bin/blast_to_complete.sh 

Tue Oct 26 10:45:38 EDT 2021
Adapting blast to complete using SAMPLE_IDENTIFIER.plasmids.blast with:
Blast identity= 60
Min len percentage= 20
Tue Oct 26 10:45:38 EDT 2021
DONE adapting blast to complete
File can be found at ./NO_GROUP/SAMPLE_IDENTIFIER/data/SAMPLE_IDENTIFIER.plasmids.complete /n

#Executing /usr/local/bin/blast_to_link.sh 

Tue Oct 26 10:45:38 EDT 2021
Adapting blast to links using SAMPLE_IDENTIFIER.plasmids.blast with:
Blast identity= 60
Min len percentage= 20
Tue Oct 26 10:45:38 EDT 2021
DONE adapting blast to link
File can be found at ./NO_GROUP/SAMPLE_IDENTIFIER/data/SAMPLE_IDENTIFIER.plasmids.links 

#Executing /usr/local/bin/gff_to_bed.sh 

SAMPLE_IDENTIFIER.gff not supplied, please, introduce a valid file
ERROR: 1 missing files, aborting execution
Tue Oct 26 10:45:38 EDT 2021
Getting bed file from GFF in SAMPLE_IDENTIFIER.gff
awk: cmd. line:2: fatal: cannot open file `./NO_GROUP/SAMPLE_IDENTIFIER/data/SAMPLE_IDENTIFIER.gff' for reading: No such file or directory

---------------------------------------

ERROR in Script gff_to_bed.sh on or near line 213; exiting with status 1
MESSAGE:

Awk command in SAMPLE_IDENTIFIER.gff".reverse.bed" creation. See ./NO_GROUP/SAMPLE_IDENTIFIER/data/logs for more information.

---------------------------------------

Check Documentation

I have checked the following places for your error:

Log files

Have you provided the following extra information/files:

System

Nextflow Installation

Container engine

drpatelh commented 2 years ago

Hi @stevin-wilson ! The simplest solution to bypass this error is to add the --skip_plasmidid parameter. Pinging @saramonzon who is one of the developers of the tool.

stevin-wilson commented 2 years ago

Thank you @drpatelh 👍 . The pipeline ran successfully with the --skip_plasmidid enabled. However, it would be very helpful if I could get PlasmidID to work in the pipeline.

saramonzon commented 2 years ago

Hi @stevin-wilson , sorry for the delay! It seems like the annotation step with prokka is not outputting the needed files. Could you send me the SAMPLE_IDENTIFIER.scaffolds.fa file so I can try to reproduce the error? Thank you very much!

saramonzon commented 2 years ago

I've tried and reproduce the error, but it seems to work fine for me running plasmidID independently. I've sent the results to @stevin-wilson by email and suggest to run everything again in case it was a one time problem from prokka. If not we'll keep looking :)

stevin-wilson commented 2 years ago

Thank you, @saramonzon. Very much appreciate your help. I will rerun the pipeline with plasmidID enabled.

drpatelh commented 2 years ago

Hi @stevin-wilson did you manage to get everything working so we can close this issue?

stevin-wilson commented 2 years ago

Hi @drpatelh, I am still getting an error message while running plasmidID.

ERROR in Script plasmidID on or near line 868; exiting with status 1
  MESSAGE:

  See ./logs/plasmidID.log for more information.
  command:
  gff_to_bed.sh -i ./NO_GROUP/sample_id/data/sample_id".gff" -
drpatelh commented 2 years ago

Thanks @stevin-wilson ! Let's pull @saramonzon back in here then.

saramonzon commented 2 years ago

This is weird! I would need the raw fastq files so I can try and reproduce the error, if you can send them to me? Or is there any problem with sharing those? Sorry for the inconvenience!!

saramonzon commented 2 years ago

Hi @stevin-wilson , i've tried to run viralrecon with your sample, but it is a sample with only 17k reads and spades does not output a the scaffolds.fa file, so viralrecon does not run plasmidID. Is this the same sample that provides the error in viralrecon, can you send me the log file for viralrecon so we can see which sample you are getting the error with? I would need the scaffolds.fasta and fastq file from the sample you are getting the error, the sample you sent me is skipped from plasmidid processing. Unless we are doing something different :S

stevin-wilson commented 2 years ago

Hi @saramonzon , Thank you so much for your time and help (and sorry for the late response; missed the notification). I was unaware of scaffolds.fa not being made for that sample.