Run new miRNA pipeline on evermanni data

sr320 commented 1 year ago

Based on https://royalsocietypublishing.org/doi/pdf/10.1098/rstb.2020.0165

in short: All sncRNA data were converted into FASTQ format and cutadapt v. 1.18 [54] was used to trim the raw reads, setting the minimal quality threshold to PHRED25, removing adaptor sequences and applying a size range of 18–40 nt.

[ ] miRNAs obtained from published studies were BLASTed (blastn) against miRBase and MirGeneDB databases.
[ ] Limited to taxa specificsncRNA-seq data, miRTrace v.1.0.0 [55] was used to group similar read sequences into clusters, to verify the quality of each dataset, miRNA size distribution and the presence of possible contami- nants, namely miRNAs of different lineages.
[ ] MirMiner [22] was applied to identify bona fide miRNAs and to provide a phylo- genetic classification of known miRNAs following up-to-date annotation criteria. In detail: (i) the presence of coverage for both arms of the miRNA sequences, (ii) the distance between the mature and star sequences being lower than 40 nt, (iii) the absence of reads mapped in the surroundings of the annotated miRNAs, (iv) 50 homogeneity of the mature miRNA, (v) 2 nt over- hang and (vi) a reduced free energy.
[ ] The genomic position of each bona fide mussel and clam miRNA was localized using blastn.

kubu4 commented 1 year ago

Just a heads up, I'm having difficulty finding the MirMiner software. The original paper describing MirMiner doesn't provide any info on how to obtain the software. Additionally, the paper with the pipeline described above doesn't provide any info on how they obtained it, either.

I'm exploring some other options to try to delve into miRNA prediction (e.g. MirMachine (github repo)). And then perhaps BLAST sRNA-seq data and see if/how many reads map to regions?

sr320 commented 1 year ago

Cristian's lab does a lot of this .. primarily with CLC but here a blurb from a paper that might offer options / directions.

Comprehensive Transcriptome Analyses in Sea Louse Reveal Novel Delousing Drug Responses Through MicroRNA regulation

A BLAST analysis was performed to discard other non-coding RNAs (short mRNAs, rRNAs, tRNA) using the specific databases of ncRNAs in NCBI, RFam, and Repbase. The tool “Extract and count” in CLC Genomics was used to identify and extract unique miRNA families annotated by BLAST against the available known miRNAs in all arthropod species miRbase (Griffiths-Jones et al. 2006). The annotation of miRNAs was conducted with the following parameters: additional downstream bases = 2; additional upstream bases = 2; maximum mismatches = 2; missing bases downstream = 2; and missing bases upstream = 2. Novel miRNA prediction was performed using the miRanalyzer software (Hackenberg et al. 2009).

kubu4 commented 1 year ago

Thanks. Have some BLASTs running ATM against the two miRNA databases.

Have miRNA loci predictions from MirMachine already.

I'll start posting more results as I get them.

kubu4 commented 1 year ago

miRTrace results:

1 read (yep, just one read) in sample sRNA-POR-82-S1-TP2 matched to the Insects clade family of miRNAs.

No matches to any clade in any other samples.

kubu4 commented 1 year ago

MirMachine results (predict presence of miRNA families in P.evermanni genome):

83 predicted miRNA loci identified in Porites_evermanni_v1.fa
Those loci represent 15 unique miRNA families.

kubu4 commented 1 year ago

NCBI BLASTn results (against miRBase and miRGene databases):

No matches from any samples in either database.

kubu4 commented 1 year ago

UPDATE:

Currently attempting to run mirdeep2 (GitHub repo - "Discovering known and novel miRNAs from small RNA sequencing data") to see what we can get from that.

kubu4 commented 1 year ago

mirdeep2 stuff is completed. I'll be sifting through the data in a bit. In the meantime, if you want to glance through some of the HTML reports...

https://gannet.fish.washington.edu/Atumefaciens/20230809-peve-miRNA-mirdeep2/result_09_08_2023_t_15_17_52.html

https://gannet.fish.washington.edu/Atumefaciens/20230809-peve-miRNA-mirdeep2/result_09_08_2023_t_18_36_57.html

https://gannet.fish.washington.edu/Atumefaciens/20230809-peve-miRNA-mirdeep2/result_10_08_2023_t_06_23_41.html

Also, there are PDFs which have actual structural representations:

https://gannet.fish.washington.edu/Atumefaciens/20230809-peve-miRNA-mirdeep2/pdfs_09_08_2023_t_15_17_52/

https://gannet.fish.washington.edu/Atumefaciens/20230809-peve-miRNA-mirdeep2/pdfs_09_08_2023_t_18_36_57/

https://gannet.fish.washington.edu/Atumefaciens/20230809-peve-miRNA-mirdeep2/pdfs_10_08_2023_t_06_23_41/

CSVs if you want to look at those, too:

https://gannet.fish.washington.edu/Atumefaciens/20230809-peve-miRNA-mirdeep2/result_09_08_2023_t_15_17_52.csv

https://gannet.fish.washington.edu/Atumefaciens/20230809-peve-miRNA-mirdeep2/result_09_08_2023_t_18_36_57.csv

https://gannet.fish.washington.edu/Atumefaciens/20230809-peve-miRNA-mirdeep2/result_10_08_2023_t_06_23_41.csv

EDITED: Added correct CSV link for last file.

kubu4 commented 1 year ago

mirdeep2 summary:

sample	novel miRNA_loci (count)
sRNA-POR-73-S1-TP2	342
SRNA-POR-79-S1-TP2	282
SRNA-POR-82-S1-TP2	262

Mean novel miRNA loci count: 295.3

These are counts of novel miRNA loci with significant randfold p-values.

NOTE: Even within individuals, there are loci which have overlapping coordinates, thus the numbers above are probably higher than they actually should be.

Used bedtools to try to get a summary of "canonical" miRNAs identified in the sRNA-seq data/genome:

/home/shared/bedtools2/bin/intersectBed \
-a result_09_08_2023_t_15_17_52.bed \
-b result_09_08_2023_t_18_36_57.bed result_10_08_2023_t_06_23_41.bed \
-u \
> results-intersect.bed

This yielded 183 "canonical" miRNAs. A couple of caveats:

Have not been screened for pass/fail randfold p-value
Overlapping loci still exist and are counted, so this is a higher count than what really is there.

kubu4 commented 1 year ago

Also, with the mirdeep2 analysis, in an effort to get some analysis done quickly, I did not run this with any miRNA database sets. I only ran it for novel miRNA discovery. As noted in the the mirdeep2 documentation, utilizing a miRNA database, even with distantly related species, will generally improve discovery.

I'll try to get that aspect of things run today, but each analysis (without database comparisons) takes about 1.5hrs. So, to process the three samples together is about 4.5hrs. I'm guessing the database comparisons will increase that run time.

JillAshey commented 1 year ago

@kubu4 could you link the code that you used for the analysis above?

kubu4 commented 1 year ago

I haven't had a chance to really write anything up. However, some of it is semi-documented in this notebook post:

https://robertslab.github.io/sams-notebook/2023/08/01/Daily-Bits-August-2023.html

Go to 20230808 to get to the start of some (most?) of the miRNA analysis.

urol-e5 / deep-dive

Run new miRNA pipeline on evermanni data #27