ndaniel / fusioncatcher

Finder of Somatic Fusion Genes in RNA-seq data
GNU General Public License v3.0
141 stars 66 forks source link

Too many reads have been removed during the pre-filtering steps #175

Closed t-neumann closed 3 years ago

t-neumann commented 3 years ago

Hi Daniel,

I get the following error for some TCGA RNA-seq datasets (info.txt):

Count of all short reads after removing reads due to missing their mate read:
-----------------------------------------------------------------------------
0

ERROR: Too many reads have been removed during the pre-filtering steps!
Please, check that the input files are from a RNA-seq dataset with pair-reads
or that the input files are given correctly!Please, check that also the input reads have the same length!
ERROR: Too many reads have been removed during the pre-filtering steps!
Please, check that the input files are from a RNA-seq dataset with pair-reads
or that the input files are given correctly!Please, check that also the input reads have the same length!

The read set itself looks good to me, is it maybe the missing quality scores?

@UNC10-SN254:498:C2KTUACXX:7:2101:8998:55606
TTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCGTGTTGGCCAGGA
+
""""""""""""""""""""""""""""""""""""""""""""""""
@UNC10-SN254:498:C2KTUACXX:7:2202:10087:13624
CCCGCCTCGGCCTCCCAAATTGCTGGGATTACAGATGTGAGCCACCGC
+
""""""""""""""""""""""""""""""""""""""""""""""""

Find also the read set I'm running (pre-filtered read set of oncogenes) below:

https://tinyurl.com/y4xj8vzo

ndaniel commented 3 years ago

Very likely, all reads are removed due to very poor quality due to Sanger score ". Most likely, replacing all " with I, should fix this.

t-neumann commented 3 years ago

Yupp that fixed it - I checked and there are new base quality scores provided in the TCGA bams. So a samtools fastq conversion automatically assigns the lowest score to them.

This parameter fixes it:

  -v INT               default quality score if not given in file [1]

so I used -v 40.