timbitz / Whippet.jl

Lightweight and Fast; RNA-seq quantification at the event-level
MIT License
105 stars 21 forks source link

Bug with whippet_quant and --stranded option? #74

Open mblue9 opened 5 years ago

mblue9 commented 5 years ago

Hello,

I've been trying out whippet v0.11 with some stranded RNA-seq data and obtained a weird result. With whippet_quant.jl and --stranded I get only a small % of reads mapping (~7%) e.g.

Whippet v0.11 loading and compiling... 
 15.869018 seconds
Loading splice graph index... .julia/v0.6/Whippet/index/graph.jls
  4.219895 seconds (3.44 M allocations: 435.519 MiB, 12.34% gc time)
Processing reads from file...
FASTQ_1: Splicing/Cutadapt/1-D_R1_cutadapt.gz
FASTQ_2: Splicing/Cutadapt/1-D_R2_cutadapt.gz
1383.427635 seconds (1.73 G allocations: 143.844 GiB, 2.40% gc time)
Finished mapping 338181 paired-end reads of length 123 each out of a total of 47367828 mate-pairs...
Calculating expression values and MLE of equivalence classes with EM:
- 847 multi-gene mapping read equivalence classes...
- 6214 multi-isoform equivalence classes...
  0.785484 seconds (38.35 k allocations: 2.397 MiB)
Finished calculating transcripts per million (TpM) after 1067 iterations of EM...
Assigning multi-mapping reads based on maximum likelihood estimate..
  0.092932 seconds (51.44 k allocations: 2.464 MiB)
Calculating maximum likelihood estimate of events...
  6.807363 seconds (19.33 M allocations: 1.178 GiB, 4.01% gc time)
Whippet v0.11 done.
1405.947481 seconds (1.76 G allocations: 146.010 GiB, 2.44% gc time)

But if I remove the --stranded parameter and then run, I then get a lot more reads mapping e.g. below (although only ~50% not sure if that's lower than expected?). Do you know why that might be?

(this is definitely data from a stranded assay and I had checked it previously with Infer Experiment which identified it as stranded reverse)

Whippet v0.11 loading and compiling... 
 16.773038 seconds
Loading splice graph index... .julia/v0.6/Whippet/index/graph.jls
  2.934588 seconds (3.44 M allocations: 435.523 MiB, 15.44% gc time)
Processing reads from file...
FASTQ_1: Splicing/Cutadapt/1-D_R1_cutadapt.gz
FASTQ_2: Splicing/Cutadapt/1-D_R2_cutadapt.gz
2313.628999 seconds (3.03 G allocations: 201.356 GiB, 4.83% gc time)
Finished mapping 24738380 paired-end reads of length 135 each out of a total of 47367828 mate-pairs...
Calculating expression values and MLE of equivalence classes with EM:
- 7683 multi-gene mapping read equivalence classes...
- 12012 multi-isoform equivalence classes...
  1.060150 seconds (38.00 k allocations: 2.360 MiB, 2.16% gc time)
Finished calculating transcripts per million (TpM) after 720 iterations of EM...
Assigning multi-mapping reads based on maximum likelihood estimate..
  0.148666 seconds (187.16 k allocations: 6.251 MiB)
Calculating maximum likelihood estimate of events...
 29.853437 seconds (62.21 M allocations: 3.689 GiB, 5.37% gc time)
Whippet v0.11 done.
2359.321552 seconds (3.11 G allocations: 206.292 GiB, 4.84% gc time)
timbitz commented 5 years ago

Interesting. I honestly have never heard of stranded-reverse data before, so --stranded assumes forward orientation, as the -h flag indicates:

  --stranded            Is the data strand specific in fwd
                        orientation? If so, increase speed with this
                        flag

Maybe the best solution here is to split it into two parameters, --stranded-fwd and --stranded-rev? Alternatively, not using the stranded flag should be fine, depending on the quality and type of data/preparation you have, your mapping rates there aren't the worst-- Whippet is fairly stringent in general.

mblue9 commented 5 years ago

Ah sorry I hadn't see the fwd in the help. Thanks for the info. I've run it as unstranded for the moment.

Splitting into 2 parameters --stranded-fwd and --stranded-rev could be good as reverse stranded (also known as first strand) is pretty common I think. It's what you get from all dUTP methods afaik including Illumina TruSeq e.g. see here: https://chipster.csc.fi/manual/library-type-summary.html