ndaniel / fusioncatcher

Finder of Somatic Fusion Genes in RNA-seq data
GNU General Public License v3.0
141 stars 66 forks source link

Error due to ASCII-64/33 conversion at step 101 #197

Open pkerbs opened 2 years ago

pkerbs commented 2 years ago

Hello, I am trying to run fusioncatcher 1.33 on the samples: SRR5484560 -> https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5484560 SRR5484561 -> https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5484561

However they fail at step 101. It seems BBDuk gets confused with N bases having a high PHRED score and changes ASCII-encoding format. Is this the issue? Would you have a solution for this? Thanks in advance

The Error messages:

for SRR5484560:

ERROR: Workflow execution failed at step 101 while executing:
----------------
   bbduk.sh \
   forcetrimmod=5 \
   in=/outputfolder/fusioncatcher/SRR5484560/orig__.fq \
   out=/outputfolder/fusioncatcher/SRR5484560/orig__x.fq
----------------
  * Size '/outputfolder/fusioncatcher/SRR5484560/orig__.fq' = 4654908938 bytes
  * Size '/outputfolder/fusioncatcher/SRR5484560/orig__x.fq' = 0 bytes

Executing second time the same step/command in order to capture error messages (i.e. STDERR)...

-------------------------------------------
java -ea -Xmx40020m -Xms40020m -cp /tools/bbmap/current/ jgi.BBDuk forcetrimmod=5 in=/outputfolder/fusioncatcher/SRR5484560/orig__.fq out=/outputfolder/fusioncatcher/SRR5484560/orig__x.fq
Executing jgi.BBDuk [forcetrimmod=5, in=/outputfolder/fusioncatcher/SRR5484560/orig__.fq, out=/outputfolder/fusioncatcher/SRR5484560/orig__x.fq]
Version 38.44

Changed from ASCII-33 to ASCII-64 on input quality @ (Q31) for base N at lines 1 and 3, position 96 while prescanning.
Changed from ASCII-64 to ASCII-33 on input quality 4 (Q-12) for base C at lines 5 and 7, position 36 while prescanning.
Exception in thread "main" java.lang.AssertionError: ASCII encoding for quality (currently ASCII-33) appears to be wrong for input quality 21 for base C at lines 5 and 7, position 36.  Please manually set qin=33 or qin=64.
ATTCATGCCACCGCTTACTATAAAGTGGACGACCCAGTGTGGAACATTCAAATTGCAAGGATGCTTGAGCTGCCCACTATCTACAGGAAAGTTTATNNNN
@SRR5484560.3/2
[64, 83, 82, 82, 53, 52, 56, 52, 53, 54, 48, 46, 51, 47, 50]
    at stream.FASTQ.testQuality(FASTQ.java:219)
    at fileIO.FileFormat.testInterleavedAndQuality(FileFormat.java:491)
    at fileIO.FileFormat.testInterleavedAndQuality(FileFormat.java:408)
    at fileIO.FileFormat.testFormat(FileFormat.java:343)
    at fileIO.FileFormat.<init>(FileFormat.java:197)
    at fileIO.FileFormat.testInput(FileFormat.java:150)
    at fileIO.FileFormat.testInput(FileFormat.java:143)
    at fileIO.FileFormat.testInput(FileFormat.java:128)
    at jgi.BBDuk.<init>(BBDuk.java:935)
    at jgi.BBDuk.main(BBDuk.java:76)`

and for SRR5484561:

java -ea -Xmx77055m -Xms77055m -cp /tools/bbmap/current/ jgi.BBDuk forcetrimmod=5 in=/outputfolder/fusioncatcher/SRR5484561/orig__.fq out=/outputfolder/fusioncatcher/SRR5484561/orig__x.fq
Executing jgi.BBDuk [forcetrimmod=5, in=/outputfolder/fusioncatcher/SRR5484561/orig__.fq, out=/outputfolder/fusioncatcher/SRR5484561/orig__x.fq]
Version 38.44

Changed from ASCII-33 to ASCII-64 on input quality B (Q33) for base N at lines 1 and 3, position 99 while prescanning.
Changed from ASCII-64 to ASCII-33 on input quality : (Q-6) for base C at lines 5 and 7, position 92 while prescanning.
Exception in thread "main" java.lang.AssertionError: ASCII encoding for quality (currently ASCII-33) appears to be wrong for input quality 27 for base C at lines 5 and 7, position 92.  Please manually set qin=33 or qin=64.
GGGTATTACTATGAAGAAGATTATTACAAATGCATGGGCTGTGACGATAACGTTGTAGATGTGGTCGTTACCTAGAAGGTTGCCTGGCTGGCCCANNNNN
@SRR5484561.2/2
[64, 83, 82, 82, 53, 52, 56, 52, 53, 54, 49, 46, 50, 47, 50]
    at stream.FASTQ.testQuality(FASTQ.java:220)
    at fileIO.FileFormat.testInterleavedAndQuality(FileFormat.java:521)
    at fileIO.FileFormat.testInterleavedAndQuality(FileFormat.java:436)
    at fileIO.FileFormat.testFormat(FileFormat.java:371)
    at fileIO.FileFormat.<init>(FileFormat.java:220)
    at fileIO.FileFormat.testInput(FileFormat.java:162)
    at fileIO.FileFormat.testInput(FileFormat.java:144)
    at fileIO.FileFormat.testInput(FileFormat.java:129)
    at jgi.BBDuk.<init>(BBDuk.java:928)
    at jgi.BBDuk.main(BBDuk.java:78)
ndaniel commented 2 years ago

Hi @pkerbs

I have not been able to replicate the issue, probably because of different versions of BBDuk being used. I have used here BBMap version 38.44.

In your case there is a forced trimming done by FusionCatcher by using BBduk on the input reads from 101 bp to 100 bp. If the reads are trimmed beforehand to 100 bp then FusionCatcher will not do any forced trimming using BBduk. Another way around this is to use the command line command --skip-trim-multiple-5 which will disable the forced trimming.

pkerbs commented 2 years ago

Hey Daniel, thanks for your response. All the input reads are 100bp long not 101. So it shouldn't do the forced trimming as you said. Just for testing, I have trimmed them to 99bp now and run it again. Still same issue. I have also used BBMap 38.44. I would like to avoid using the --skip-trim-multiple-5 parameter to make the results consistent to other samples. Do you maybe have another idea what to do? It seems that BBDuk doesn't like 'N' bases having a high quality score and falsely changes the ASCII encoding. Do you maybe know of a script that changes the quality of all N bases in a fastq file?

Best wishes, Paul

ndaniel commented 2 years ago

Hi @pkerbs

indeed your are right the reads are 100 bp long.

As I wrote before I was not able to reproduce the bug that you reported. Here is how I tried to reproduced it:

mkdir fq
cd fq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR548/000/SRR5484560/SRR5484560_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR548/000/SRR5484560/SRR5484560_2.fastq.gz
cd ..
fusioncatcher.py -i fq -o results

Maybe there are issues with the downloading of the FASTQ files and they are somehow broken?

For example, for the read above if I do this:

zgrep -2 ATTCATGCCACCGCTTACTATAAAGTGGACGACCCAGTGTGGAACATTCAAATTGCAAGGATGCTTGAGCTGCCCACTATCTACAGGAAAGTTTATNNNN *.gz

I get this:

@SRR5484560.3 HWI-ST942:51:C3CGDACXX:2:1101:1459:2183/1
ATTCATGCCACCGCTTACTATAAAGTGGACGACCCAGTGTGGAACATTCAAATTGCAAGGATGCTTGAGCTGCCCACTATCTACAGGAAAGTTTATNNNN
+
?@@FFFDEHHDFHIIGGDHGDHHII<AB9CAGGHHHG3?9?B<BBHEBFFIIICHGHIGIEHHFEHD>DBEEEEDD@AACCCDDCDD9AA<ACDD@@ACA
pkerbs commented 2 years ago

Hi Daniel, grepping for the read yielded this result:

@SRR5484560.3 HWI-ST942:51:C3CGDACXX:2:1101:1459:2183 length=100
ATTCATGCCACCGCTTACTATAAAGTGGACGACCCAGTGTGGAACATTCAAATTGCAAGGATGCTTGAGCTGCCCACTATCTACAGGAAAGTTTATNNNN
+SRR5484560.3 HWI-ST942:51:C3CGDACXX:2:1101:1459:2183 length=100
?@@FFFDEHHDFHIIGGDHGDHHII<AB9CAGGHHHG3?9?B<BBHEBFFIIICHGHIGIEHHFEHD>DBEEEEDD@AACCCDDCDD9AA<ACDD@@ACA

The fastq seems ok, its just formatted a little differently, since I downloaded it years ago through SRAToolkit.

But now I downloaded the same files as you did from the FTP and I still have the same error when I run those files. I don't know what else could be the issue.

EDIT: I just noticed that I didn't include the previous text of the error message:

ERROR: Workflow execution failed at step 101 while executing:
----------------
   bbduk.sh \
   forcetrimmod=5 \
   in=/outputfolder/fusioncatcher/SRR5484560/orig__.fq \
   out=/outputfolder/fusioncatcher/SRR5484560/orig__x.fq
----------------
  * Size '/outputfolder/fusioncatcher/SRR5484560/orig__.fq' = 4654908938 bytes
  * Size '/outputfolder/fusioncatcher/SRR5484560/orig__x.fq' = 0 bytes

Executing second time the same step/command in order to capture error messages (i.e. STDERR)...

-------------------------------------------

orig__x.fq seems to have size 0 on which bbduk is executed. Maybe there is an issue already before bbduk? Thanks for your time