rajewsky-lab / mirdeep2

Discovering known and novel miRNAs from small RNA sequencing data
GNU General Public License v3.0
135 stars 49 forks source link

sRNA spike-ins #116

Open mschilli87 opened 10 months ago

mschilli87 commented 10 months ago

Dear @Drmirdeep,

More and more sRNA-seq projects utilize RNA spike-ins as controls.

What is the recommended workflow to quantify these along miRNAs using miRDeep2's quantifier module? I added the corresponding sequences to the mature miRNA input FASTA but they don't get picked up. I am guessing I need to at least make up some 'hairpins' for them as well (How? Can I just use the mature sequence or do I need to actually craft legit hairpin-structures to bypass some internal filtering? And what do I do with spike-ins that are longer than your typical mature miRNA but well within the read length - say 35nt RNA spike-in @ 51 cycles single-end Illumina sequencing?). But will I need to make up a fake chromsome as well that hosts those fake hairpins and re-build an new genome bowtie index just to quantify the spike-ins along endogenous miRNAs?

Any advise would be highly appreciated.

Also, would you be up to adding a new optional input FASTA with 'mature' spike-in sequences that are quantied as-is by the quantifier module?

Drmirdeep commented 10 months ago

You will just need your spikein sequences in your mature file and precursor file definition (length doesn't matter)

cat mature.fa
>spikein1
CTACGATCGATGCTAGCTAGCAGGCGATCAGCATCGATGATCGATTA

cat pres.fa
>spikein1
CTACGATCGATGCTAGCTAGCAGGCGATCAGCATCGATGATCGATTA

just run it in its most simple form and give the mature.fa as the precursor input as well

perl quantifier.pl -r reads_collapsed.fa -m mature.fa -p mature.fa

so just add your spikins to your mature file and to the precursor file

mschilli87 commented 10 months ago

Thank you. I can confirm that copying the 'mature' sequences to the 'precursor' FASTA results in miRDeep2 picking up the RNA spike-ins for quantification.

mschilli87 commented 7 months ago

Sorry for re-opening this: I noticed that only reads matching the spike-in exactly are quantified using this approach. I noticed because the results did not add up and by a simple grep 'analysis' I found that most reads in my data that contain an exact match of a specific spike-in contain a final T after pre-processing. I do not know if this could be an actual problem with the spike-ins or is just an artifact from insufficient adapter-removal but I found it weird that the exact 'mature' sequence plus a single (non-templated) nucleotide would not be picked up mir miRDeep2. Now I am wondering if that is due to the processed read being longer than the 'precursor' sequence of that spike-in. Any advice on how to adjust my input or tweak the parameters to pick those up?


edit: Just a quick update: It looks like indeed I missed a leading T in the 3' adapter sequence during pre-processing. I'll try to see if this is enough to fix the issue. Still, I am curious if this is expected as untemplated bases added to miRNAs post-transcriptionally have been shown to be biologically relevant so it would be good to know if miRDeep2 misses those by default or if this is related to the precursor=mature hack I used for the spike-ins.

Drmirdeep commented 7 months ago

Then just put a few A‘s in the end of the precursor. Mismatches can be allowed up to 2 if I remember correctly. If the mature is longer than the precursor then bowtie will fail to map I guess

mschilli87 commented 7 months ago

That's what I am currently guessing as well. Do you know if N's would be allowed as well? I would prefer not to bias the mapper towards any specific untemplated base. Thank you for the fast response.

mschilli87 commented 7 months ago

I can confirm the spike-in counts changed by orders of magnitude while miRNA counts remains almost unchanged. So the extreme sensitivity to non-templated bases appears to be an issue for the spike-ins only, likely due to what was discussed here. I will leave this issue open until as a reminder to update it with my final conclusions in case anyone else searches this tracker looking for advice how to include spike-ins.

mschilli87 commented 7 months ago

When padding the 'mature' spike-ins with ten N's on either side for the 'hairpins' I do not get any spike-in quantification. IIRC, miRDeep2 discards sequences with non ACTG(U?) characters (at least by default). ~@Drmirdeep: Would you expect A's to perform better and if so, do you share my concerns regarding the bias introduced by those?~


edit: As expected, replacing the N's with A's did yield spike-in results again, with slightly more reads mapping to them as before. I feel like some padding should be done to remove the bias agains spike-ins (compared to endogenous miRNAs) in case of imperfect pre-processing but I am concerned that picking a specific nucleotides introduces another bias. @Drmirdeep: Could you provide your point of view?

Drmirdeep commented 7 months ago

If the preprocessing is the problem then this should be solved on the preprocessing side not on a tool that expects the input data to follow certain conventions

mschilli87 commented 7 months ago

I guess I was not clear enough: The pre-processing issue was the reason the quantification was off orders of magnitude. That is fixed and not my issue. However, that made me realise that a read that, for whatever reason, has an extra base maps just fine for actual miRNAs but not for spike-ins. The whole point of spike-ins is to serve as controls. So they should be treated as close to identitcal as possible. So in a way the problem is that my pre-processing mistake affected spike-ins but not miRNAs (or at least much less). Padding the spike ins with A's helps, because now they are treated more similar as miRNAs. But as along as I fix that padding, there is still a bias that is spike-in specific. And ideally, nothing should be specific for the thing you normalize to. 😉