Too many miRNAs? - Githubissues

jonbra commented 7 years ago

First of all, great program! Very easy to use and flexible.

I've been running the pipeline on a brown algae and I get 2037 predicted miRNAs! But most of them have a lot of reads mapping to the loop region, and I don't think this qualifies as an miRNA? E.g. this:

>miRNA-precursor_200 NODE_129_length_55666_cov_574.076:41848-41919 +
>> Read mappings for sample: thallus
5'->3'
UCGAUGUUCGAUGUUCGGUUUCUUGAUGUUUGGUUUACUCCGAAUAUCGAGAAACGAACAUCGAACAUCGU total_mapped_reads=3440
.((((((((((((((((.(((((((((((((((......))))))))))))))))))))))))))))))).
....TGTTCGATGTTCGGTTTCTTGATGTT......................................... depth=1, length=26
.....GTTCGATGTTCGGTTTCT................................................ depth=1, length=18
.....GTTCGATGTTCGGTTTCTTG.............................................. depth=4, length=20
.....GTTCGATGTTCGGTTTCTTGA............................................. depth=301, length=21
.....GTTCGATGTTCGGTTTCTTGAT............................................ depth=10, length=22
......TTCGATGTTCGGTTTCTT............................................... depth=9, length=18
......TTCGATGTTCGGTTTCTTG.............................................. depth=52, length=19
mmmmmmTTCGATGTTCGGTTTCTTGAmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm depth=1775, length=20 [mature]
......TTCGATGTTCGGTTTCTTGAT............................................ depth=16, length=21
......TTCGATGTTCGGTTTCTTGATG........................................... depth=1, length=22
.......TCGATGTTCGGTTTCTTG.............................................. depth=8, length=18
.......TCGATGTTCGGTTTCTTGA............................................. depth=177, length=19
.......TCGATGTTCGGTTTCTTGAT............................................ depth=2, length=20
........CGATGTTCGGTTTCTTGA............................................. depth=952, length=18
........CGATGTTCGGTTTCTTGAT............................................ depth=11, length=19
........CGATGTTCGGTTTCTTGATG........................................... depth=1, length=20
........CGATGTTCGGTTTCTTGATGTTTGG...................................... depth=1, length=25
.........GATGTTCGGTTTCTTGAT............................................ depth=86, length=18
.........GATGTTCGGTTTCTTGATG........................................... depth=2, length=19
.........GATGTTCGGTTTCTTGATGTTTGGT..................................... depth=2, length=25
.........GATGTTCGGTTTCTTGATGTTTGGTT.................................... depth=3, length=26
....................TCTTGATGTTTGGTTTACTCCGA............................ depth=2, length=23
...........................GTTTGGTTTACTCCGAATA......................... depth=1, length=19
..............................TGGTTTACTCCGAATATCGAG.................... depth=1, length=21
..............................TGGTTTACTCCGAATATCGAGA................... depth=3, length=22
..............................TGGTTTACTCCGAATATCGAGAA.................. depth=1, length=23
......................................TCCGAATATCGAGAAACG............... depth=1, length=18
.......................................CCGAATATCGAGAAACGAACATC......... depth=1, length=23
........................................CGAATATCGAGAAACGAACATC......... depth=1, length=22
..............................................TCGAGAAACGAACATCGAAC..... depth=1, length=20
...............................................CGAGAAACGAACATCGAAC..... depth=1, length=19
...............................................CGAGAAACGAACATCGAACA.... depth=1, length=20
ssssssssssssssssssssssssssssssssssssssssssssssssGAGAAACGAACATCGAACAssss depth=1, length=19 [star]
...................................................AAACGAACATCGAACATCG. depth=1, length=19
....................................................AACGAACATCGAACATCG. depth=9, length=18

But would you say that this is a true miRNA?:

>miRNA-precursor_1430 NODE_4145_length_8143_cov_145.377:505-686 -
>> Read mappings for sample: thallus
5'->3'
ACCGCGAGACUUUGACUUGAAACGGAGGAUUUCUCGAGAUACAAUGACUUCAGUCGUAAAUCGAGGUAUUUUAACGGUUUUCGGGUGUGAUUUUUCACCGAUAUCGUAGGGAAACACCUCGAUUUACGACUGAAGUCAUUGUAUCUCGAGAAAUCCUCCAUUUCAAGUCAAACUCUCGCGG   total_mapped_reads=34
.((((((((.((((((((((((.((((((((((((((((((((((((((((((((((((((((((((.((((.(((((..((((.((........)))))).)))))...)))).)))))))))))))))))))))))))))))))))))))))))))).)))))))))))).))))))))
sssssssssssssssssssssssssAGGATTTCTCGAGATACAATGsssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss   depth=2, length=21 [star]
..............................................................................TTTCGGGTGTGATTTTTCAC...................................................................................   depth=1, length=20
..........................................................................................................................................TTGTATCTCGAGAAATCCT........................   depth=1, length=19
..........................................................................................................................................TTGTATCTCGAGAAATCCTC.......................   depth=1, length=20
mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmTTGTATCTCGAGAAATCCTCCmmmmmmmmmmmmmmmmmmmmmm   depth=29, length=21 [mature]

underasail commented 5 years ago

I know this is an older issue, but there was a recently released paper from the people behind ShortStack that updates the criteria for computationally annotating miRNAs (http://www.plantcell.org/content/30/2/272.long). I've written a couple of scripts to weed out false positives from miR-PREFeR based on this paper's mapping criteria and then to produce quick structure images to analyze manually for their structural criteria. Doing something similar should help to greatly reduce your false positive output in miR-PREFeR.

jonbra commented 5 years ago

Thanks for your comment! Currently we are simply remapping the smallRNA reads to the precursors and manually screening the output using a slightly different set of criteria. But we have also started to make scripts for doing this. We recently discussed that paper in connection to this, but we haven't figured out how to adjust for the one-nucleotide variation in miRNA/miRNA*.

Would be happy to share some code or collaborate on this if you are interested.

underasail commented 5 years ago

As I've been relying on the output from miR-PREFeR my work has been relatively simplistic and may not be able to offer much to you. My current workflow involves running miR-PREFeR (usually with a depth of 0.25 reads per million reads mapped) then running this script in the "readmapping" directory to find the precursors with appropriate 1nt variant mapping percentages. The precursor secondary structure file is then filtered down to only include those with >= 75% of 1nt variants mapping appropriately, and I pass that file to VARNA with another script to produce figures like this: precursor_3 that I can use to manually annotate appropriate structures.

jonbra commented 5 years ago

Thanks for sharing! VARNA was new to me. Looks great!

hangelwen / miR-PREFeR

Too many miRNAs? #3