ShortStack may be incorrectly distinguishing between mature and star miRNA sequences

shedurkin commented 3 months ago

I was taking a look at database matches among the three species and noticed something odd. All three species have miRNAs with database matches to publications of mir-2030, but that was not one of the miRNAs identified as conserved across the three species. I compared the mature miRNA sequences for these loci and, while the Apul mir-2030 and Peve mir-2030 mature sequences were identical, the Pmea mir-2030 mature sequence was very different. How, then, did all three of these sequences have database matches to the same known miRNA (mir-2030)?

I finally realized that the Pmea mir-2030 star sequence was identical to the Apul and Peve mir-2030 mature sequences. This raises the question -- how confident can we be that ShortStack is correctly distinguishing the mature and star sequences? I'm not actually sure how it makes this distinction, and I haven't had the chance yet to search for other instances of this happening.

Screenshot (248) (link to above spreadsheet)

If anyone wants to look into this, the links to shortstack outputs are below Apul: https://github.com/urol-e5/deep-dive/blob/main/D-Apul/output/13.2.1-Apul-sRNAseq-ShortStack-31bp-fastp-merged-cnidarian_miRBase/ShortStack_out Peve: https://github.com/urol-e5/deep-dive/blob/main/E-Peve/output/08.2-Peve-sRNAseq-ShortStack-31bp-fastp-merged/ShortStack_out Pmea: https://github.com/urol-e5/deep-dive/blob/main/F-Pmea/output/13.2.1-Pmea-sRNAseq-ShortStack-31bp-fastp-merged-cnidarian_miRBase/ShortStack_out

JillAshey commented 3 months ago

Hmm interesting...I'm looking through the Axtell 2013 ShortStack paper and here's their workflow:

And here are my notes from that paper when we first looked at it: Workflow for short stack

Phase 1: Islands of significant alignment coverage are IDed (islands are defined as continuous genomic coordinates where the depth of alignment coverage exceeds a user-set minimum threshold option –mindepth). Initial islands are padded upstream and downstream by a user-set number of nucleotides (option –pad). Regions that overlap after padding are then merged to form small RNA clusters
Phase 2: Filters small RNAs of interest based on user-set size range (default 20-24 nt). Loci where the number of RNAs within the user-set size range meets a minimum threshold (default = 0.8) are annotated as ‘Dicer-derived’ and others are annotated as not Dicer-derived (Dicer is an enzyme that cleaves pre-miRNA into miRNA. Unsure of what it means to say a small RNA is ‘Dicer-derived’). The predominant small RNA size of each cluster is reported as the ‘DicerCall’
Phase 3: Retrieval of genomic sequences for RNA folding
Phase 4: Identify qualifying hairpin structures. Loci unlikely to be hairpin or miRNA associated are excluded from analysis, as well as clusters annotated as non-Dicer derived and highly repetitive clusters. Potential smRNA-associated hairpins are filtered to retain only those that meet a minimal structural criteria. ShortStack has two preset “miRTypes” for the animal and plant kingdoms, which increase specificity by excluding structures with unusual characteristics (what qualifies as an unusual characteristic?). Animal miRType excludes 15% of all annotated miRBase structures for metazoans and 6% of all known viral miRBase structures. Plant miRType would exclude 12% of all known miRBase structures from Viridiplantae
Phase 5: Annotation of hairpin-associated and miRNA. Annotation requires structural and expression-based criteria. Annotation of a locus as a miRNA indicates there is sufficient evidence to justify annotation without assistance from prior knowledge (ie prior annotation as a miRNA for conservation in another species). Hairpins with strong evidence of causality with respect to observed small RNA patterns are annotated as either hairpin-derived (HP) or miRNA loci
Phase 6: Reports produced including summary file, alignment details for HP or miRNA clusters and a .bed formatted file

My guess is that something weird is going on in Phase 4 or 5? I'll continue looking into it.

I was also looking at the Shortstack code using the merged reads w/ the cnidarian miRNA database. When I look at the venn diagrams in the Apul code and the Pmea code, the database miRNA numbers don't seem to match up. In Apul, the database miRNAs are 68 and in Pmea, the database miRNAs are 29. The venn diagram for the Peve code has 36 database miRNAs. Is there some difference between the databases that are being used? Or am I just bad at counting?

kubu4 commented 3 months ago

If anyone wants to look into this,

It would be helpful if you provided links to the code you used to find these matches.

Also, are you able to edit your post and provide an updated image? Currently, when I click on it to enlarge it (currently, it's too small for me to read it all), it throws an error saying the image can't be found.

I finally realized that the Pmea mir-2030 star sequence was identical to the Apul and Peve mir-2030 mature sequences.

Why would this be a concern?

shedurkin commented 3 months ago

It would be helpful if you provided links to the code you used to find these matches.

I was just manually playing around with the shortstack outputs in Excel

Also, are you able to edit your post and provide an updated image? Currently, when I click on it to enlarge it (currently, it's too small for me to read it all), it throws an error saying the image can't be found.

Just added a link to the spreadsheet I screen-shotted

I finally realized that the Pmea mir-2030 star sequence was identical to the Apul and Peve mir-2030 mature sequences.

Why would this be a concern?

I think incorrect identification of star/mature sequences would a) mean I'd need to redo identification of conserved miRNAs across species, and b) be something we may need to mention in the paper

sr320 commented 3 months ago

Let's put a hold on this and get Methods and Results written.. then we can loop back if deemed necessary.

shedurkin commented 3 months ago

In Apul, the database miRNAs are 68 and in Pmea, the database miRNAs are 29. The venn diagram for the Peve code has 36 database miRNAs. Is there some difference between the databases that are being used? Or am I just bad at counting?

Well those venn diagrams are showing the overlap between sequences identfied by ShortStack as miRNAs and those that have matches in the reference cnidarian+mirbase database. I've looked manually at the ShortStack output results.txt files from each species and confirmed the numbers shown in the venn diagrams are correct. Our interpretation of the species having different numbers of database matches was that the species are not equally represented in the cnidarian+mirbase database. That is, Acropora is more well-studied, so it probably has more known miRNAs that have been published/annotated.

With that being said, I did go back and double check what reference database was used in each of the three files you linked. Apul and Peve both use the file cnidarian-mirbase-mature-v22.1.fasta (See line 125 in Apul, line 113 in Peve), but the Pmea code uses a file called cnidarian_miRNAs.fasta (line 121 of Pmea code). I'm not familiar with the cnidarian_miRNAs.fasta file, but it looks like it's just the cnidarian miRNAs that Jill found, without the addition of miRBase sequences. I'm pretty sure the Pmea ShortStack is one of the analyses you ran @kubu4, based on the commit history. Is the use of cnidarian_miRNAs.fasta as a reference just a mistake, or is there a reason we should be using that file instead of cnidarian-mirbase-mature-v22.1.fasta?

shedurkin commented 3 months ago

oops, sorry Steven, I didn't refresh my page before posting

kubu4 commented 3 months ago

reference just a mistake

Yep. Will need to be re-run.

kubu4 commented 3 months ago

I think incorrect identification of star/mature sequences

What makes you think it's incorrect?

What is a star sequence and what is it's role in miRNA?

shedurkin commented 3 months ago

miRNA-biogenisis

Basically, the precursor miRNA sequence is broken down into an miRNA duplex, comprised of two strands. The strand that is preferentially loaded into the AGO protein complex to form RISC (RNA-induced silencing complex) is called the mature miRNA, while the other strand (which generally just gets broken down) is called the star miRNA (or miRNA*). Since the mature miRNA is the one which primarily ends up guiding gene expression, it's the one that miRNA identification/analysis focuses on. (I have a slightly more detailed summary of miRNA biogenisis in my lab notebook)

However, it isn't clear to me how ShortStack is determining which strand of the miRNA duplex would be prefferentially loaded to the RISC complex (aka which would be the mature miRNA).

shedurkin commented 3 months ago

Just found this in the ShortStack documentation:

"This is a FASTA formatted file containing hairpin, mature miRNA, and miRNA* sequences derived from ShortStack's identification of MIRNA loci. These are genomic sequences, and the genomic coordinates are noted in the FASTA header. ShortStack's determination of mature miRNA vs. miRNA* strands is based on abundance of alignments at that particular locus. These designations may not always be accurate for an entire MIRNA family .. sometimes one paralog can attract most of the true mature miRNA alignments, leaving the other paralogs with mostly true miRNA* alignments. Take care when performing annotations."

shedurkin commented 3 months ago

reference just a mistake

Yep. Will need to be re-run.

do you want to rerun this Sam, or should I?

kubu4 commented 3 months ago

I'll go ahead and take care of it.

kubu4 commented 3 months ago

Re-running now. @shedurkin - Thanks for catching this, btw!!

kubu4 commented 3 months ago

@shedurkin - I've re-run. Please glance through results and make sure new results/figures are added/adjusted in manuscript.

shedurkin commented 3 months ago

Out of curiosity I tried rerunning the conserved-miRNA analysis with both the mature and star sequences, and it looks like the first example I identified (miR-2030) is probably the only instance of an miRNA star being identical to other mature miRNAs in our read data.

ShortStack bases its mature/star classification on read counts because, generally, the unutilized miRNA-star strands get quickly broken down by cell machinery. However, I found a paper identifying several instances of miRNA-star reads outnumbering associated mature miRNA reads, so that assumption definitely doesn't hold 100%. For miRNAs with database matches we could use the known mature miRNA sequences to double check ShortStack's mature/star classification, but for novel miRNAs I think we just have to accept the ShortStack results.

As for what to do for the deep-dive paper, I'll leave that up to you @JillAshey . We could revise the count of conserved miRNAs to include miR2030 or keep the results as they are right now, and we could either include a discussion of possible mature/star misidentification or just accept the ShortStack assumptions. Let me know what you think!

urol-e5 / deep-dive

ShortStack may be incorrectly distinguishing between mature and star miRNA sequences #49