nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
868 stars 695 forks source link

Error during NFCORE_RNASEQ:RNASEQ:FASTQ_SUBSAMPLE_FQ_SALMON:SALMON_QUANT #1104

Closed alexmascension closed 10 months ago

alexmascension commented 10 months ago

Description of the bug

I am running the nfcore/rnaseq pipeline with human samples to do a next step in the recognition of bacterial reads. In this analysis I include several human samples, and two control samples with only bacterial reads. The aim of this step is (in part) to use the unmapped reads for further processing. The problem comes with the controls with bacterial reads that, unexpectedly, have only a few human reads, and I think this step conflicts with quantification by salmon. The error I receive is the following:

ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:FASTQ_SUBSAMPLE_FQ_SALMON:SALMON_QUANT (BLACTIS)'

Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:FASTQ_SUBSAMPLE_FQ_SALMON:SALMON_QUANT (BLACTIS)` terminated with an error exit status (1)

Command executed:

  salmon quant \
      --geneMap Homo_sapiens.GRCh38.109.gtf \
      --threads 6 \
      --libType=A \
      --index salmon \
      -1 BLACTIS.subsampled_R1.fastq.gz -2 BLACTIS.subsampled_R2.fastq.gz \
      --skipQuant \
      -o BLACTIS

  if [ -f BLACTIS/aux_info/meta_info.json ]; then
      cp BLACTIS/aux_info/meta_info.json "BLACTIS_meta_info.json"
  fi

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:FASTQ_SUBSAMPLE_FQ_SALMON:SALMON_QUANT":
      salmon: $(echo $(salmon --version) | sed -e "s/salmon //g")
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  -----------------------------------------
  | Loading contig table | Time = 7.4997 s
  -----------------------------------------
  size = 37303049
  -----------------------------------------
  | Loading contig offsets | Time = 68.5 ms
  -----------------------------------------
  -----------------------------------------
  | Loading reference lengths | Time = 408.96 us
  -----------------------------------------
  -----------------------------------------
  | Loading mphf table | Time = 644.68 ms
  -----------------------------------------
  size = 3784757355
  Number of ones: 37303048
  Number of ones per inventory item: 512
  Inventory entries filled: 72858
  -----------------------------------------
  | Loading contig boundaries | Time = 2.8887 s
  -----------------------------------------
  size = 3784757355
  -----------------------------------------
  | Loading sequence | Time = 459.83 ms
  -----------------------------------------
  size = 2665665915
  -----------------------------------------
  | Loading positions | Time = 4.9999 s
  -----------------------------------------
  size = 3536480116
  -----------------------------------------
  | Loading reference sequence | Time = 549.67 ms
  -----------------------------------------
  -----------------------------------------
  | Loading reference accumulative lengths | Time = 1.0233 ms
  -----------------------------------------
  [2023-10-25 14:40:38.647] [jointLog] [info] done
  [2023-10-25 14:40:38.700] [jointLog] [info] Index contained 252334 targets
  [2023-10-25 14:40:38.745] [jointLog] [info] Number of decoys : 194
  [2023-10-25 14:40:38.745] [jointLog] [info] First decoy index : 252099 

Additionally, it states

hits: 27, hits per frag:  2.74969e-05[2023-10-25 14:40:46.365] [jointLog] [warning] salmon was only able to assign 8 fragments to transcripts in the index, but the minimum number of required assigned fragments (--minAssignedFrags) was 10. This could be indicative of a mismatch between the reference and sample, or a very bad sample.  You can change the --minAssignedFrags parameter to force salmon to quantify with fewer assigned fragments (must have at least 1).

Work dir:
  /data/Proyectos/EVs/work/93/093cbb3b93b8a604fe15259850ed46

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

I don't really know what can I do, aside from removing the bacterial controls. I tried --aligner star_rsem, --skip_qualimap and --skip_pseudo_aligment and the error is the same. Can I run the pipeline without running salmon, or something like that?

Command used and terminal output

nextflow run \
    nf-core/rnaseq \
    -r 3.12.0 \
    -profile docker \
    --input $CWD/samples_rnaseq.csv \
    --outdir $RESULTS_RNASEQ \
    --aligner star_salmon \
    --skip_bbsplit \
    --star_index $DATABASE_DIR/genome/index/star \
    --salmon_index $DATABASE_DIR/genome/index/salmon \
    --extra_salmon_quant_args "--minAssignedFrags 1" \
    --rsem_index $DATABASE_DIR/genome/rsem \
    --save_unaligned \
    --skip_qualimap \
    --skip_pseudo_alignment \
    --max_cpus $NUM_CPUS \
    --max_memory $NUM_RAM \
    --max_time $MAX_TIME \
    --fasta $DATABASE_DIR/genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
    --gtf $DATABASE_DIR/genome/Homo_sapiens.GRCh38.$VERSION_ENSEMBLE_GENOME.gtf \
    --save_reference

Relevant files

nextflow.log

System information

Nextflow version: 23.04.1.5866 Hardware: Desktop Executor: local Container engine: Docker OS: Ubuntu 22.04 Version of nf-core/rnaseq: v3.12.0

MatthiasZepper commented 10 months ago

The reason why your attempts to override --minAssignedFrags 1 via extra_salmon_quant_args fail is, that the pipeline stops much earlier, namely during the infer strandedness step.

To skip that step, you just have to provide the strandedness of your samples upfront in the samplesheet (forward or reverse instead of auto).

If the strandedness is unknown to you and you must run it, then provide a custom config with a process selector

    withName: '.*:FASTQ_SUBSAMPLE_FQ_SALMON:SALMON_QUANT' {
        ext.args   = '--skipQuant --minAssignedFrags 1'
        }

Apart from this problem specifically, I wonder if the dualrnaseq pipeline wouldn't be more applicable to the research question you are seemingly working on?

alexmascension commented 10 months ago

Hi @MatthiasZepper , so far adding strandness seems to be doing the trick. I will keep you informed if anything happens. Regarding the dualrnaseq, seems like a good option, although I need to do a more specific analysis and does not quite fit.

Thanks!