timbitz / Whippet.jl

Lightweight and Fast; RNA-seq quantification at the event-level
MIT License
105 stars 21 forks source link

Error quantifying psi with fq.gz #94

Closed jordi-vaquero closed 4 years ago

jordi-vaquero commented 4 years ago

I am trying to quantify some files with the following line of code,

julia whippet-quant.jl /dir/SRR814563_1.fq.gz /dir/SRR814563_2.fq.gz -x /dir/index.hg38.fa.gz.jls --biascorrect

But it failed with the following error

Whippet v0.11.1 loading and compiling...
 15.322103 seconds
Loading splice graph index... /dir/index.hg38.fa.gz.jls
 10.126881 seconds (8.43 M allocations: 1.061 GiB, 11.62% gc time)
Processing reads from file...
FASTQ_1: /dir/SRR814563_1.fq.gz
FASTQ_2: /dir/SRR814563_2.fq.gz
ERROR: LoadError: BioSequences.FASTQ.Reader file format error on line 1 ~>"SRR81456"
Stacktrace:
 [1] _read!(::BioSequences.FASTQ.Reader, ::BioCore.Ragel.State{BufferedStreams.BufferedInputStream{BufferedStreams.BufferedInputStream{Libz.Source{:inflate,BufferedStreams.BufferedInputStream{IOStream}}}}}, ::BioSequences.FASTQ.Record) at /home/jordi/.julia/v0.6/BioCore/src/ReaderHelper.jl:164
 [2] read! at /home/jordi/.julia/v0.6/BioCore/src/ReaderHelper.jl:134 [inlined]
 [3] read! at /home/jordi/.julia/v0.6/Whippet/src/record.jl:11 [inlined]
 [4] read_chunk! at /home/jordi/.julia/v0.6/Whippet/src/reads.jl:17 [inlined]
 [5] #process_paired_reads!#62(::Int64, ::Bool, ::Int64, ::Requests.ResponseStream{TCPSocket}, ::Requests.ResponseStream{TCPSocket}, ::Bool, ::Function, ::BioSequences.FASTQ.Reader, ::BioSequences.FASTQ.Reader, ::Whippet.AlignParam, ::Whippet.GraphLib, ::Whippet.GraphLibQuant{Whippet.SGAlignPaired,Whippet.JointBiasCounter}, ::Whippet.MultiMapping{Whippet.SGAlignPaired,Whippet.JointBiasCounter}, ::Whippet.JointBiasMod) at /home/jordi/.julia/v0.6/Whippet/src/reads.jl:151
 [6] (::Whippet.#kw##process_paired_reads!)(::Array{Any,1}, ::Whippet.#process_paired_reads!, ::BioSequences.FASTQ.Reader, ::BioSequences.FASTQ.Reader, ::Whippet.AlignParam, ::Whippet.GraphLib, ::Whippet.GraphLibQuant{Whippet.SGAlignPaired,Whippet.JointBiasCounter}, ::Whippet.MultiMapping{Whippet.SGAlignPaired,Whippet.JointBiasCounter}, ::Whippet.JointBiasMod) at ./<missing>:0
 [7] macro expansion at /home/jordi/.julia/v0.6/Whippet/src/timer.jl:5 [inlined]
 [8] main() at /home/jordi/.julia/v0.6/Whippet/bin/whippet-quant.jl:169
 [9] include_from_node1(::String) at ./loading.jl:576
 [10] include(::String) at ./sysimg.jl:14
 [11] process_options(::Base.JLOptions) at ./client.jl:305
 [12] _start() at ./client.jl:371
while loading /home/jordi/.julia/v0.6/Whippet/bin/whippet-quant.jl, in expression starting on line 5

I checked the format of the fq.gz files and it is proper, and that has been just downloaded with sra-tools


SRR814563_2.fastq000066401110530151376�*�=13603552164014057 0ustar  jordi@SRR814563.1 1 length=76
GTTGAACTCCCTCATTTTCCATTCCCCAGCATTGGCGGGTTCTGGGGCGGGTGGCGGGGGGGGGTCGCTTGGGTTT
+SRR814563.1 1 length=76
B@@FDBDBFFHHHIGIIIGIJIEEFGGEGGEHG3???FH#####################################
@SRR814563.2 2 length=76
AGCGCGTCCGGGGGGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGGGGGGCCGGCGCGGGGGGAGGGACACATGCTA
+SRR814563.2 2 length=76
=8?+B:@?F1C?F(606<0(6:6B-:B#################################################
@SRR814563.3 3 length=76
CCCCAACCCCCGCCCGTAGGCTTGCGGCTCTGCGCCTGCGCCACGCCTCCACCCCTGGACGCGCTAGCATGTGTCT```

Please can you help me with that?
timbitz commented 4 years ago

I'm not sure if this is a typo in what you pasted into github, or what is actually in your file, but the pasted bit is not in correct format, as it starts with SRR814563_2.fastq00006... and not an @ symbol.

Can you double check that?

jordi-vaquero commented 4 years ago

Hi, yes that is a header in the file included by the sratool when you generate the files using it.

timbitz commented 4 years ago

That is really strange. I have three comments:

(1) To my knowledge there is no such thing as a FASTQ header... as far as I understand it the format is defined as consecutive FASTQ entries which are 4 lines in total. Are you sure you aren't somehow capturing some text into those files that shouldn't be there with whatever you're doing on the command line? For example, the "header" appears to have your userid jordi in it??

(2) If you are completely certain this is a legitimate header that is supposed to be there... then feel free to open an issue at http://github.com/BioJulia/BioSequences.jl -- the error you are getting

ERROR: LoadError: BioSequences.FASTQ.Reader file format error on line 1 ~>"SRR81456"
Stacktrace:
 [1] _read!(::BioSequences.FASTQ.Reader, ::BioCore.Ragel.State{BufferedStreams.BufferedInputStream{BufferedStreams.BufferedInputStream{Libz.Source{:inflate,BufferedStreams.BufferedInputStream{IOStream}}}}}, ::BioSequences.FASTQ.Record) at /home/jordi/.julia/v0.6/BioCore/src/ReaderHelper.jl:164
 [2] read! at /home/jordi/.julia/v0.6/BioCore/src/ReaderHelper.jl:134 [inlined]

is an error thrown by the BioSequences.FASTQ.Reader, and not any code in Whippet... As of now (and in the future) Whippet uses the standard BioJulia parsers for standard file formats. If a file produced by a specific program doesn't fit the expected format of the standard parsers, then that's beyond the scope of this repository. But again, I'm skeptical that your FASTQ file is in proper format.

(3) You can download FASTQ files for every dataset in GEO from ebi.ac.uk without having to deal with sra format at all... to my knowledge, these all work with Whippet. I personally use this to avoid sratools entirely.

Hope that helps! I'm going to close this issue anyways, as I don't think it's related to any code in the Whippet.jl repository.