nf-core / pathogensurveillance

Surveillance of pathogens using population genomics and sequencing
https://nf-co.re/pathogensurveillance
MIT License
10 stars 4 forks source link

When only SRA accessions are provided in metadata input file, what happens if both SE and PE fastq reads are downloaded for particular accession? #65

Closed masudermann closed 1 month ago

masudermann commented 2 months ago

Description of the bug

The pipeline may already be able to handle this, but what happens if both SE and PE are available for a particular strain? In Camilo's mixed euk dataset (mixed.csv), there is an SRA accession: SRR10432277 that fits this scenario.

I am getting an error when pipeline is at the read alignment step. It is trying to use both the PE and SE reads as inputs, but then I think this cause an issue.

Command used and terminal output

# An example of the error I see:

[83/984875] NOTE: Process `PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:BWA_MEM (GCA_031834405_1_PHW726_fox_matthiolae)` terminated with an error exit status (1) -- Execution is retried (1)
ERROR ~ Error executing process > 'PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:BWA_MEM (GCA_031834405_1_PHW726_fox_matthiolae)'

Caused by:
  Process `PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:BWA_MEM (GCA_031834405_1_PHW726_fox_matthiolae)` terminated with an error exit status (1)

Command executed:

  INDEX=`find -L ./ -name "*.amb" | sed 's/\.amb$//'`

  bwa mem \
      -M \
      -t 16 \
      $INDEX \
      SRR10432277_1_subset.fastq.gz SRR10432277_2_subset.fastq.gz SRR10432277_subset.fastq.gz \
      | samtools view  --threads 16 -o GCA_031834405_1_PHW726_fox_matthiolae.bam -

  cat <<-END_VERSIONS > versions.yml
  "PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:BWA_MEM":
      bwa: $(echo $(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*$//')
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
         -y INT        seed occurrence for the 3rd round seeding [20]
         -c INT        skip seeds with more than INT occurrences [500]
         -D FLOAT      drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
         -W INT        discard a chain if seeded bases shorter than INT [0]
         -m INT        perform at most INT rounds of mate rescues for each read [50]
         -S            skip mate rescue
         -P            skip pairing; mate rescue performed unless -S also in use

  Scoring options:

         -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
         -B INT        penalty for a mismatch [4]
         -O INT[,INT]  gap open penalties for deletions and insertions [6,6]
         -E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
         -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5]
         -U INT        penalty for an unpaired read pair [17]

         -x STR        read type. Setting -x changes multiple parameters unless overridden [null]
                       pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0  (PacBio reads to ref)
                       ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0  (Oxford Nanopore 2D-reads to ref)
                       intractg: -B9 -O16 -L5  (intra-species contigs to ref)

  Input/output options:

         -p            smart pairing (ignoring in2.fq)
         -R STR        read group header line such as '@RG\tID:foo\tSM:bar' [null]
         -H STR/FILE   insert STR to header if it starts with @; or insert lines in FILE [null]
         -o FILE       sam file to output results to [stdout]
         -j            treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
         -5            for split alignment, take the alignment with the smallest coordinate as primary
         -q            don't modify mapQ of supplementary alignments
         -K INT        process INT input bases in each batch regardless of nThreads (for reproducibility) []

         -v INT        verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [3]
         -T INT        minimum score to output [30]
         -h INT[,INT]  if there are <INT hits with score >80% of the max score, output all in XA [5,200]
         -a            output all alignments for SE or unpaired PE
         -C            append FASTA/FASTQ comment to SAM output
         -V            output the reference FASTA header in the XR tag
         -Y            use soft clipping for supplementary alignments
         -M            mark shorter split hits as secondary

         -I FLOAT[,FLOAT[,INT[,INT]]]
                       specify the mean, standard deviation (10% of the mean if absent), max
                       (4 sigma from the mean if absent) and min of the insert size distribution.
                       FR orientation only. [inferred]

  Note: Please read the man page for detailed description of the command line and options.

  [main_samview] fail to read the header from "-".

Work dir:
  /home/marthasudermann/pathogensurveillance/work/a0/44cd71b6fd13a59faa6d3786d5301e

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

Relevant files

No response

System information

No response

zachary-foster commented 1 month ago

I ran into this error as well and fixed it by only using the paired end reads when this happens