nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
923 stars 708 forks source link

Inconsistent sequence and quality lengths in FASTQ files created by SortMeRNA #1456

Open cihanerkut opened 1 day ago

cihanerkut commented 1 day ago

Description of the bug

This issue was discussed in SortMeRNA repository already (https://github.com/sortmerna/sortmerna/issues/407).

STAR failed for one sample due to the sequence and quality lengths mismatching for a read. After TrimGalore I have this

@ST-K00265:389:HMJW3BBXY:1:2223:9039:32244 2:N:0:ATCCACTG+ACGCACCT
ATAAAGTTGAAGGCTACAAGAAGACCAAGGAAGCTGTTTTGCTCCTTAAGAAACTTAAAGCCTGGAATGATATCAAAAAGGTCTATGCCTCTCAGCGAATG
+
<A-AFF<FJJFFF<AAFJJ<FAJFF<JFFF-JJJJJJJJJJJ7JJJJJFJJFJJJJJJJF<JJJF<JFJJJFAJJFFFFJFJJJJAAJJJJJJJJJJJFAA

which becomes this after SortMeRNA:

@ST-K00265:389:HMJW3BBXY:1:2223:9039:32244 2:N:0:ATCCACTG+ACGCACCT
ATAAAGTTGAAGGCTACAAGAAGACCAAGGAAGCTGTTTTGCTCCTTAAGAAACTTAAAGCCTGGAATGATATCAAAAAGGTCTATGCCTCTCAGCGAATG
+
<A

I had to deactivate the SortMeRNA step to make it work.

Would it be possible to add a failsafe for FASTQ integrity after each step that generates a FASTQ file? If necessary, fix the FASTQ file on the fly? I suggest this as a general solution.

Command used and terminal output

Command used:

nextflow run nf-core/rnaseq \
  -r 3.17.0 \
  -profile dkfz \
  --input samples.csv \
  --outdir ${PWD} \
  --genome null \
  --fasta ${REFERENCE_GENOME} \
  --gtf ${REFERENCE_GTF} \
  --additional_fasta ${REFERENCE_PHIX} \
  --gencode \
  --seq_center DKFZ \
  --remove_ribo_rna \
  --save_merged_fastq \
  --save_reference \
  --save_trimmed \
  --save_align_intermeds \
  --save_unaligned \
  --save_non_ribo_reads \
  --igenomes_ignore

Terminal output:

STAR version: 2.7.11b   compiled: 2024-07-03T14:39:20+0000 :/opt/conda/conda-bld/star_1720017372352/work/source
  Nov 20 22:22:10 ..... started STAR run
  Nov 20 22:22:10 ..... loading genome
  Nov 20 22:24:17 ..... processing annotations GTF
  Nov 20 22:24:45 ..... inserting junctions into the genome indices
  Nov 20 22:25:59 ..... started 1st pass mapping

Command error:
  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred

  EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length
  @ST-K00265:389:HMJW3BBXY:1:2223:9039:32244
  ATAAAGTTGAAGGCTACAAGAAGACCAAGGAAGCTGTTTTGCTCCTTAAGAAACTTAAAGCCTGGAATGATATCAAAAAGGTCTATGCCTCTCAGCGAATG
  <A
  SOLUTION: fix your fastq file

  Nov 20 22:31:36 ...... FATAL ERROR, exiting

Relevant files

No response

System information

Nextflow version: 24.10.1 Hardware: HPC Executer: lsf Container engine: Singularity OS: CentOS 7 Version of nf-core/rnaseq: 3.17.0