Start and end on miRNA paralogs

xbdr86 commented 6 years ago

Hi! I have question for the miRTop community.

How would you define the "precursor start/end" in case of reads that can be assigned to paralogs (about ~ 15% of described miRNA have multiple copies with exact mature sequence)?

column4/5: start/end: precursor start/end as indicated by alignment tool

lpantano commented 6 years ago

Hi,

Thanks for the question.

I would say if the sequence map exactly to different precursors, having the same mature miRNA, you can choose to only use one paralog and use the position of that one, or multiple the number of lines, one for each paralogs. Use the attributes Parents and Name to give more information.

Does that make sense?

If not, can you give a specific example with sequence and numbers? That way is easier to be on the same page.

Cheers

On May 22, 2018, at 2:03 PM, xbdr86 notifications@github.com wrote:

Hi! I have question for the miRTop community.

How would you define the "precursor start/end" in case of reads that can be assigned to paralogs (about ~ 15% of described miRNA have multiple copies with exact mature sequence)?

column4/5: start/end: precursor start/end as indicated by alignment tool

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/miRTop/incubator/issues/19, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HGSaCECIsEHo8zZAuntqsETSMGxFks5t1FLngaJpZM4UJIdU.

xbdr86 commented 6 years ago

Hi @lpantano!

Thanks for your fast response!

For instance, I was thinking in the case that I have been working more recently of mir-9. This mature miRNA can arise from 3 different paralogs. mir-9-1 (http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000466) mir-9-2 (http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000467) mir-9-3 (http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000468)

This is an extremely abundant miRNA in brain, thus generating hundreds of 3' isomiRs. Interestingly, when studied separately only one (paper coming soon hopefully!) of them generates a 5' isomiR of functional importance (Tan et al. NAR 2014). I think an annotation system that would annotate this 5' isomiR to each paralog could be misguiding for future interpretations of the data. So far in our custom program QuagmiR (https://github.com/Gu-Lab-RBL-NCI/QuagmiR/) we were annotating all miR-9 reads under the following naming structure:

hsa-miR-9-5p-1-2-3 hsa-miR-9-3p-1-2-3

On the practical end, annotating each read under multiple gene locations would generate a significant amount of data duplicity on the GFF file, although I don't see an easy way to deal with columns 1, 3, 4.

Have a nice day!

lpantano commented 6 years ago

Thank you for the example!

In the case of the isomiR 5’, for sure, you only give one, the one is coming from. What I do in my tool, is giving only one of them when the match is perfect to more than one precursor because I am interesting on the miRNA itself and not the parent.

I totally get your point to increase redundancy of the GFF, although the mirtop code could handle this redundancy.

What we talked time ago was to have another attribute to add multiple Parents. So ideally, Parent is used for the representative precursor and the other attribute can be used to add the rest. But we never came with a name.

According to GFF3 original format, Parent can have multiple parents, you can separate them with a ‘,’. For instance: Parent has-miR-9-5p-1,has-miR-9-5p-2,has-miR-9-5p-3

That should be valid, what do you think?

However in this case you say, the 5’ isomiR only should have one Parent not the 3 of them, even if the 3’ isomiRs have. Does that make sense?

Thanks!

On May 23, 2018, at 10:07 AM, xbdr86 notifications@github.com wrote:

Hi @lpantano https://github.com/lpantano!

Thanks for your fast response!

For instance, I was thinking in the case that I have been working more recently of mir-9. This mature miRNA can arise from 3 different paralogs. mir-9-1 (http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000466 http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000466) mir-9-2 (http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000467 http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000467) mir-9-3 (http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000468 http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000468)

This is an extremely abundant miRNA in brain, thus generating hundreds of 3' isomiRs. Interestingly, when studied separately only one (paper coming soon hopefully!) of them generates a 5' isomiR of functional importance (Tan et al. NAR 2014). I think an annotation system that would annotate this 5' isomiR to each paralog could be misguiding for future interpretations of the data. So far in our custom program QuagmiR (https://github.com/Gu-Lab-RBL-NCI/QuagmiR/ https://github.com/Gu-Lab-RBL-NCI/QuagmiR/) we were annotating all miR-9 reads under the following naming structure:

hsa-miR-9-5p-1-2-3 hsa-miR-9-3p-1-2-3

On the practical end, annotating each read under multiple gene locations would generate a significant amount of data duplicity on the GFF file, although I don't see an easy way to deal with columns 1, 3, 4.

Have a nice day!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/miRTop/incubator/issues/19#issuecomment-391360325, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HBVnfQQOc8SjLUx6ZZAECk5F2lYlks5t1W0fgaJpZM4UJIdU.

xbdr86 commented 6 years ago

Hi!

Yes, you are right the issue of which parent pri-miRNA to assign is quite important for us. Do you think it might work to arbitrarily assign reads that can belong to multiple parents to paralog-1, and indicating on attributes that that particular sequences has let's say 3 paralogs? And any read that can be uniquely mapped to one of the paralogs, to the corresponding parent?

For example: Given the following parents for miR-7-5p

>hsa-mir-7-1 MI0000263
UUGGAUGUUGGCCUAGUUCUGUG_UGGAAGACUAGUGAUUUUGUUGUU_**UUU**AGAUAACUAAAUCGACAACAAAUCACAGUCUGCCAUAUGGCACAGGCCAUGCCUCUACAG 

>hsa-mir-7-2 MI0000264
CUGGAUACAGAGUGGACCGGCUGGCCCCAUC_UGGAAGACUAGUGAUUUUGUUGUU_**GUC**UUACUGCGCUCAACAACAAAUCCCAGUCUACCUAAUGGUGCCAGCCAUCGCA

>hsa-mir-7-3 MI0000265
AGAUUAGAGUGGCUGUGGUCUAGUGCUGUG_UGGAAGACUAGUGAUUUUGUUGUU_**CUG**AUGUACUACGACAACAAGUCACAGCCGGCCUCAUAGCGCAGACUCCCUUCGAC

Present the following reads in GFF like that:

_UGGAAGACUAGUGAUUUUGUUGUU_ hsa-miR-7-1 READ_COUNT=1000 NUMBER_OF_PARALOGS=3
_UGGAAGACUAGUGAUUUUGUUGUU_**UUU** hsa-miR-7-1 READ_COUNT=1000 NUMBER_OF_PARALOGS=1
_UGGAAGACUAGUGAUUUUGUUGUU_**GUC** hsa-miR-7-2 READ_COUNT=1000 NUMBER_OF_PARALOGS=1
_UGGAAGACUAGUGAUUUUGUUGUU_**CUG** hsa-miR-7-3 READ_COUNT=1000 NUMBER_OF_PARALOGS=1

_canonical-sequence_
**templated-tail**

PS: Sorry, for the long delay in my response, I missed the notification e-mail from GitHub.

lpantano commented 6 years ago

Hi,

no worries. I think is better to name the other paralogs, in case some tools wants to do something with that information. I am happy to have another attribute with other_parents, and add the names separated with ,. I am happy to have number of paralogs as well. Let me know and I'll add that in the definition file in github. As well, you have Hits attribute, where you can use it for this information. ->https://github.com/miRTop/incubator/blob/master/format/definition.md

Let me know if that helps. Thanks for working on this!

ThomasDesvignes commented 6 years ago

Hi, I agree with Lorena. In my case I am really interested in knowing from which gene/locus a mature sequence can originate because the regulatory elements for each locus may be different and therefore each locus may be involved differently in various situations (I know that for example in fish in some tissue one locus is more expressed then the other, while in another tissue the other locus is the most expressed, and that matters to me). Therefore, I don't like the idea of arbitrarily attributing a sequence to a paralog and I prefer conserving the complete information. Especially in some cases we have some isomiRs that are slightly longer and then we can know with confidence that they come from only one of the paralogs.

phillipeloher commented 6 years ago

Internally we favor reporting everything so that nothing is missed. Below are some illustrative examples - I picked some random sequences to illustrate the point.

Example 1 shows that in 3 hairpins the 3p end of the isomiR differs by 1nt from the annotated mature. But on one of the hairpins, for the same sequence, the 3p end differs by 2nt (in the opposite direction) of the annotated mature.

Whereas example 2 shows a sequence that could come from 5 different precursors.

Example 1: isomiR Sequence TGGGGCGGAGCTTCCGGAGGC with possible locations: MIMAT0015058_2&hsa-miR-3180-3p&offsets|0|-1 MIMAT0018178_1&hsa-miR-3180&offsets|0|+2 MIMAT0015058_1&hsa-miR-3180-3p&offsets|0|-1 MIMAT0015058&hsa-miR-3180-3p&offsets|0|-1

Example 2: isomiR Sequence CTCTAGAGGGAAGCACTTTCT with possible locations: MIMAT0002845_1&hsa-miR-526a-5p&offsets|0|-1 MIMAT0002845&hsa-miR-526a-5p&offsets|0|-1 MIMAT0002841&hsa-miR-518f-5p&offsets|0|-1 MIMAT0005456&hsa-miR-518d-5p&offsets|0|-1 MIMAT0005455&hsa-miR-520c-5p&offsets|0|-1

Some other things to consider:

we like to report the sequence (and/or license plate) of an isomiR to help avoid any confusion when annotations (e.g. miRBase) entries or assemblies change.
not all isomiRs (generated from a precursor) overlap with an annotated mature. In this case, reporting it based on coordinates and/or precursor offsets is often helpful
reporting offsets like in the above example is OK (e.g. people are used to it) but it indirectly implies that the annotated mature is the most abundant or correct one. This is often not the case. Also, annotations in things like miRBase vs miRCarta don't always match in this regard.

xbdr86 commented 6 years ago

Thanks @lpantano @ThomasDesvignes @phillipeloher ! I will take into account your suggestions! ;-)

miRTop / incubator

Start and end on miRNA paralogs #19