miRTop / incubator

Where all ideas and discussions happen to lead to new repositories
5 stars 4 forks source link

Proposals #8

Open lpantano opened 7 years ago

lpantano commented 7 years ago

Please, describe below the problem you think we face in the miRNA/isomiR naming.

Try to summarize it in 200 words. The current discussions are here:

Comments on this will be used to propose the solutions for the next step.

Thanks

lpantano commented 7 years ago

I think we need to agree on some rules to define isomiRs. Beside being functional or not all of them, I think it would be good to create a TAG that easily can tell you whether there is a modification at 5', 3', or nucleotide substitution compared to a canonical one?. Proposing a format tools can share will help to improve the detection and improve true positives.

For sure beside that, is that is the canonical miRNA.

ivlachos commented 7 years ago

Lpantano is right on target. Regardless of function, being artifacts of the experiment or the analysis pipeline, the observed sequences require characterization.

To me, the field might have irreversibly lost the train of mature miRNA naming (vs genes and lncRNAs that can be named also based on function, apart from their IDs) but we can still provide a meaningful convention for the isomiRs.

The challenges are many: do we start from a blank page in order to identify a de novo solution or do we try to design something close to what is available today for other entities (e.g. gene mutations), in order to minimize the transition overhead?

These and many more should be addressed asap, so that we can at least communicate efficiently with each other... and who knows? it might also be the key to start systematically addressing this neglected part of the field, and to uncover biological meaning under numbers and sequences.

ThomasDesvignes commented 7 years ago

I agree with everything that has been said above.

We can debate the putative functional differences between two isomiRs (seed-shift, edition, size differences, etc), but no matter the end point of that discussion, their existence and recurrent presence in sRNA sequencing analyses call into some sort of action. We tried to raise that specific question in our article about miRNA nomenclature published in 2015 in Trends in Genetics by proposing the use of a "RefSeq isomiR" that is fixed once and for all, and the use of isomiRs defined based on their variations to the RefSeq-isomiR.

Concerning the start point, I would vote in favor of a start from something that is already in use for other systems. Similarly to the way we proposed a miRNA nomenclature system that was following the general gene and gene products guidelines established by nomenclature consortia, I think we should also try to use already existing conventions for the description of isomiRs. Having something that is not in agreement with general gene nomenclature consortium guidelines will undoubtedly lead to those consortia not using this nomenclature. Also, not only it will minimize the transition overhead, but also treat miRNA genes and their products like any other gene types and try to keep an overall uniformity in naming systems between classes of gene. I've already proposed several options (based on protein coding gene mutation) in the isomiR naming discussion page.

One of the big issue I can see right now is the problem of a database website that would have all the isomiRs (and RefSeq isomiRs, etc, - Depending on which direction we take of course) because for example miRBase hasn't been updated for a while now (June 2014) and the up to date information is now simply all over the place. That could also be something to think of: what can we do to centralize information and keep it up to date?

BastianFromm commented 7 years ago

As I mentioned to Lorena before, I am very happy to be part of this but I feel that there is a major challenge before we can actually get to the problem of naming isoforms of microRNAs:

Define or agree on an existing definition of what is and what is NOT a microRNA. This is of huge importance for me because it is well defined by biogenesis and evolutionary context what is a bona fide microRNA. We don't want to end up naming microRNAs if they are tRNA fragments, do we? Next - on my list - would be to arrive at what is the canonical product(s) that are then used as a counterpart, or better, the reference of the isoform (truncation, elongations are then measured RELATIVE to the canonical product).

If we agree on this I see that we might want to arrive at a system for isoMirs (of bona fide microRNAs) and a system for other misc_sRNAs-isoforms that would need naming, too.

As pointed out by Thomas miRBase is literally discontinued and not suited as a database of isoMirs. In addition it contains many false sequences and incorrect annotations. We curated the complements of 4 vertebrates for the Annual Reviews paper and are in process to finish the inclusion of 20+ representatives of major Metazoan lineages and expression information for all available tissues (note that we show both old a new names in gene list):

image

Further we have developed detailed reads representation for individual genes (and datasets), too. We keep all information in a modified sam-format (we map collapsed reads) and this is an excellent starting point for also assigning isoMir-names; in fact we had a CIGAR-like system in mind.

image

In other words, MirGeneDB.org could certainly be this isoMir repository and could thus be used to promote the common naming system.

As I see it there are two partly overlapping systems for naming microRNAs in place:

  1. miRBase that named in a consecutive but unfortunately inconsistent manner and without differentiating between mature and star/passenger strand.
  2. MirGeneDB that keeps the old miRBase names (for real miRNAs!) but adds another level of naming that is consistent and uniform for all parts of miRNAs, also outside human (novelty: "p" designation for paralogues). The system applied herein can of course be expanded to Mors, loops and isoMirs and to primary transcripts and their variants if known (!)

image

To use the system proposed by Desvignes et al in 2015 would imply to erect a third - independent - database with yet another system and in my opinion, although using Gene Nomenclature Consortia rules, would not be more accepted than ours because we essentially upgrade and simplify miRBase names while this system uses changed rules (i.e. confusing animal from plant miRNAs by removing the "-"). I can also see serious problems if we would want to name more than 4 species if you don't use species delimiters.

image

ivlachos commented 7 years ago

I would like once more to thank Lorena. The topics raised in just a few hours are crucial and legitimate. From my experience in the field, I would suggest to attempt one step at a time. My 5 cents:

  1. Which microRNA is true or not is far more complex than settling on a useful naming convention. This task will need more resources than this git has at the moment. I would suggest to tackle this after we have more people from the community involved. I'm not yet convinced on the rules to use for this and I believe I'm not the only one.

  2. The nomenclature is crucial, since we definitely need to be able to talk to each other. As Lorena, Thomas and Bastian mentioned above there are currently many options for a naming convention that need to be discussed and compared. I wouldn't jump on the "which is the true miRNA" question, since even for the most widely accepted ones (e.g. let family, mir-1, etc), we see everyday variations that are waiting to be named.

  3. Database, etc: Creating a reference DB, providing it to the community, updating and maintaining it properly are distinct things. I've felt the pain. I would suggest to focus, as this Git signifies, on a community effort. We haven't started yet! If this project is fruitful and we manage to produce something, I believe we will find a way. I think it's better to start discussing using a blank sheet than trying to make things fit in databases that already exist. When (and if) we have a product, then a new database (intitutionalized?, community-based?, etc) is an option, as an existing one (e.g. mirGeneDB, miRBase) is another, we could contact RNA Central or other ncRNA hubs for support and so forth..

I'm here because of a call for miRNA people to address a specific task. Let's stretch our wings with the isomiR naming convention and if we manage to do this, we can then attempt to fly.

lpantano commented 7 years ago

Hi all,

thanks for this awesome brainstorming. I think is great all these issues arise, because then we can prioritize. I am waiting for some other people to chime in, since they may have a different view. I think this is going great, and this is what I was looking to happen, so we can work toward the same goal in the future.

I will give my two cents in the isomiR format: I think VCF files are a good format where we can add the CIGAR or similar to the alternative allele field, the read counts into the genotype field (where you have one column for each sample, and all the names we want into the INFO field). This should be easy to parse, there are many tool out there and easy to create.

Let's be open to all ideas, and spend some days really thinking about what everyone here have said!

Thanks a lot for your participation!

BastianFromm commented 7 years ago

Ivlachos. What is your real identity?

On Apr 11, 2017 17:26, "ivlachos" notifications@github.com wrote:

I would like once more to thank Lorena. The topics raised in just a few hours are crucial and legitimate. From my experience in the field, I would suggest to attempt one step at a time. My 5 cents:

1.

Which microRNA is true or not is far more complex than settling on a useful naming convention. This task will need more resources than this git has at the moment. I would suggest to tackle this after we have more people from the community involved. I'm not yet convinced on the rules to use for this and I believe I'm not the only one. 2.

The nomenclature is crucial, since we definitely need to be able to talk to each other. As Lorena, Thomas and Bastian mentioned above there are currently many options for a naming convention that need to be discussed and compared. I wouldn't jump on the "which is the true miRNA" question, since even for the most widely accepted ones (e.g. let family, mir-1, etc), we see everyday variations that are waiting to be named. 3.

Database, etc: Creating a reference DB, providing it to the community, updating and maintaining it properly are distinct things. I've felt the pain. I would suggest to focus, as this Git signifies, on a community effort. We haven't started yet! If this project is fruitful and we manage to produce something, I believe we will find a way. I think it's better to start discussing using a blank sheet than trying to make things fit in databases that already exist. When (and if) we have a product, then a new database (intitutionalized?, community-based?, etc) is an option, as an existing one (e.g. mirGeneDB, miRBase) is another, we could contact RNA Central or other ncRNA hubs for support and so forth..

I'm here because of a call for miRNA people to address a specific task. Let's stretch our wings with the isomiR naming convention and if we manage to do this, we can then attempt to fly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/miRTop/incubator/issues/8#issuecomment-293299767, or mute the thread https://github.com/notifications/unsubscribe-auth/AaAi3xjBHLob0WIisC6xUMToVHyN28T4ks5ru5uvgaJpZM4M5Rq4 .

lpantano commented 7 years ago

Hi Bastian,

He is Ioannis Vlachos, Ph.D. DIANA tools developer currently working in Boston in neurology department (i think). Super interested in isomiRs.

Can you both chime in here with your full name and affiliation so I can add you to the main page. That way we know each other.

https://github.com/miRTop/miRTOP.github.io/issues/1

Thanks!

ivlachos commented 7 years ago

Thanks Lorena for the e-introduction!

My affiliation is: Ioannis Vlachos, PhD, Brigham & Women's Hospital, Broad Institute of MIT and Harvard, Harvard Medical School

I thought the only credential I needed was to be interested in the project :)

You can check out my previous works in https://scholar.google.gr/citations?user=mhRFBnEAAAAJ&hl=en

BastianFromm commented 7 years ago

Wow. Cool! Will add my details tomorrow but see my signature below...

On Apr 11, 2017 22:01, "ivlachos" notifications@github.com wrote:

Thanks Lorena for the e-introduction!

My affiliation is: Ioannis Vlachos, PhD, Brigham & Women's Hospital, Broad Institute of MIT and Harvard, Harvard Medical School

I thought the only credential I needed was to be interested in the project :)

You can check out my previous works in https://scholar.google.gr/ citations?user=mhRFBnEAAAAJ&hl=en

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/miRTop/incubator/issues/8#issuecomment-293383531, or mute the thread https://github.com/notifications/unsubscribe-auth/AaAi35mf-w-mc8Nf5im9l7_dBQtmv3amks5ru9wegaJpZM4M5Rq4 .

ThomasDesvignes commented 7 years ago

Hi! This is great! Thanks for providing inputs! With Lorena we were feeling a bit lonely for a while :)

I totally agree with Ioannis that what makes a "real miRNA" is another topic. Everyone has different vision because different interests. Some see miRNAs as evolutionary entities and care a lot about their pure genetic nature and biogenesis, some see them as functional products and don't care that much about how they were made but more about what they do, some others can see them in a totally different way. I've created a new Issue about "What is a miRNA?" Some ideas can be tossed into that new bag! https://github.com/miRTop/incubator/issues/9

That being said, to establish an isomiR nomenclature system, we need a miRNA nomenclature system to use as a foundation. The definition of what a miRNA is should not prevent us from finding a system that works for everyone. Given that I, and other people from the mouse and zebrafish gene nomenclature consortia, have personally proposed a nomenclature system, I’ll let other people comment and propose before chiming in (let’s say I’m a bit biased!). Comments can be made here: https://github.com/miRTop/incubator/issues/4

And nonetheless, to establish an isomiR system on the miRNA system we’ll eventually choose, we also need to agree also on what modifications we want to be able to inform in the name? What are the existing modifications? Some preliminary ideas have been laid out here and are awaiting comments :) https://github.com/miRTop/incubator/issues/1

Cheers!

mhalushka commented 7 years ago

I think it's a great idea to formalize a naming convention for isomiRs. I would add, though, that if you have deep RNA-seq data (20 million + reads), for abundant miRNAs (let-7b, miR-21-5p, etc) you could have 1000+ different species of isomiR detected depending on your analysis tool. This includes many singleton sequences with internal edits, suggestive of sequencing error. But it also includes the full range of non-template extensions on different length templated "starting" miRNAs. At what level do we stop trying to name them all and need to have a grab bag of 'other?' How can we cluster "like" isomiRs by biological function?

Marc Halushka MD, PhD Associate Professor Department of Pathology Johns Hopkins University SOM

mlhack commented 7 years ago

Hello everybody, Sorry for stepping in late. Before stating my points below, I would like to comment on some issues already discussed. i) I think also that we should start with the nomenclature of microRNAs leaving the question ‘what is a true microRNA?’ to a later time point. The isomiR nomenclature can only be meaningfully set up downstream of the miRNA naming. The connection point between miRNA and isomiR annotation/naming is the canonical sequence. One of the first questions is therefore: should we always annotate only one canonical sequence per miRNA gen, or should the annotation system take into account that sometimes the canonical sequence (i.e. the most abundant one) can vary between tissues. This obviously has a direct consequence on the isomiR naming. ii) For me it is important to maintain the short species names. Probably I need to rethink, but right now I don’t think that we have to fit to the nomenclature of gene symbols (upper/lower letters). This is also due to practical considerations, when profiling the expression values of microRNAs. The short names makes it easy to extract a sub-set of sequences out of a database (for example those known in a certain species, or all sequences that should be used to detect putative homologs (a set of species clearly defined by its short species names))

In order to set up a functional and coherent annotation system, I think that we should first ask ourselves: what should a microRNA annotation nomenclature accomplish?

Here are some points that I consider important (probably already mentioned some of the discussion threads) 1) Correspondence between mature and pre-microRNA names: Perfect correspondence between mature and hairpin names, i.e. if I have the name of the hairpin --> I can know the name of the mature sequences and vice-versa (this is not possible right now with miRBase because if multiple copies exist of a microRNA, no coherent nomenclature rules exist in miRBase) 2) Definition of the canonical sequence: should define and name the canonical sequence & point out if it is a constitutive canonical (same sequence in all known tissues) or regulated canonical (depends on the tissue) 3) Guide and passenger strand: If a clear distinction between guide and passenger strand can be made at a functional level, this must be reflected in the naming (with the good old ‘*’ for example) 4) Evolutionary information and family naming (I): The naming should include information about paralogues and homologues (like in miRGeneDB): to achieve this, a (evolutionary family) seed definition is needed. 5) Evolutionary information and family naming (II): If the seed region changes --> the function of the microRNA changes: should the microRNAs that are homologous but having different functions (regulate different genes because they have different seeds) receive the same name?

The problem: A microRNA gene can have i) Several identical copies in the genome (the hairpin sequences are identical) ii) Both mature sequences, guide and passenger, are identical, but not the hairpin and/or the pre-miRNA iii) The guide strand has several copies but with different passenger strands

I think that the copies complicate the naming

The name of the microRNA should: a) Make it possible to distinguish between (ii) and (iii) in order to accomplish 1) (above list) b) In case of (i), a genome level annotation should give different names to the identical copies – maybe by adding a copy number after the name like done by UCSC. c) In case of (i), the fasta annotation should give only one name to the sequence (which occurs at different loci in the genome) d) In case of (iii): the names of the passenger strand should indicate from which of the ‘copies’ (paralogues?) they are obtained, BUT without being redundant (i.e. without including the same sequence several times in the DB) (important to accomplish point 1)

Other things that needs to be fixed / taken into consideration: A) Length of the hairpin sequences Right now, in miRBase each hairpin sequence is pre-microRNA + X nt flanking sequences: X can be anything and is not defined by miRBase. This number needs to be fixed (5 nt, 10 nt, 15 nt – what ever). B) Multiple copy / arm inconsistency At least in plants, sometimes, one guide sequence can be obtained from several hairpins – but from different arms --> which names should be given to those?? (in plants I observe them often and internally called them ‘zwitter’ microRNAs). Although I don’t know if they exist in animals, maybe the annotation system should be able to accommodate these things. C) Several guide sequences from one hairpin At least in plants, different microRNAs can be obtained from one pri-microRNA which furthermore can be overlapping!!! Like above, I don’t know if they exist in animals, maybe the annotation system should be able to accommodate these things.

Those are my thoughts for now. Thanks to Lorena and Thomas for starting this. I think that it is a very important issue – but which is much more complicated that it might look at a first glance. Best, Michael

ThomasDesvignes commented 7 years ago

Thanks Michael! Those are really important comments, considerations and suggestions that will definitively be helpful to move forward!

TJU-CMC commented 7 years ago

First off, we would like to thank Lorena for taking the lead on this very important and increasing complicated problem. We have looked at the discussion of the last several days and have attempted to compile a list of oustanding questions. Note that the list is not meant to be complete. For some of the questions, we appended some first thoughts.

We would like to propose the following two steps (perhaps Lorena can help create a "sticky" post?):

The Jefferson Team (Eric Londin, Phillipe Loher, Aris Telonis, Isidore Rigoutsos)

Jefferson Team's Position: strictly speaking, there should be some evidence of Drosha/DGCR8 dependence; practically, however, we will need to consider other options

Jefferson Team's Position: if it is Argonaute-loaded and short (18-24 nts) then it is a microRNA / on the other hand if it is transcribed from a microRNA-locus but Argonaute-loading has not been reported then it is a potential-microRNA

Jefferson Team's Position: strictly speaking, if an isomiR is Argonaute-loaded it is an isoform / if it transcribed from a microRNA-locus but there is evidence of Argonaute loading then it should be treated as a "potential-isomiR"

Jefferson Team's Position: we believe that legacy names should be grand-fathered considering the thousands of publications in the last ~15 years

Jefferson Team's Position: there is a lot of evidence in the literature that evolutionary conservation is an unnecessarily limiting constraint

Jefferson Team's Position: we have published evidence that isomiRs are tightly linked with a time and a location; being 'canonical' is a historical artifact

Jefferson Team's Position: we do not believe so, because the concept of guide and passenger is tightly linked with a time and a location

Jefferson Team's Position: yes; however, how we generate such labels will be a matter of deliberation.

Jefferson Team's Position: decoupling microRNA labels from a list of monotonically increasing integers (`a la miRBase) might be a good idea as it allows flexibility and bypasses the need for brokering; however, what a solution could look like will be a matter of deliberation.

Jefferson Team's Position: decoupling microRNA labels from a list of monotonically increasing integers (`a la miRBase) might be a good idea as it allows flexibility and bypasses the need for brokering; however, what a solution could look like will be a matter of deliberation.

Jefferson Team's Position: we think that a brokering approach will likely be untenable in the long rung, slow things down, and possibly create animosity

Jefferson Team's Position: we think that this is linked to the concept of a broker and may be untenable in the long rung

lpantano commented 7 years ago

Thanks all to chime in time! I think this is a wonderful thread and open discussion. I learnt from here already and make me to think about some topics I wasn't paying attention.

The Jefferson Team make a good summary (thanks). I will create new issues for each of the questions/problems have arisen from here (but I will do it one by one so we make sure resolve something, and some of them depend one from another).

Starting next week, I will open the first issue. And we'll try to come to a solution based on majorities of votes. It could happen that in some case we decide that the question is not inside the scope of this group for this specific moment, so we can move on. As well, it will be a deadline, otherwise, it would be difficult to move on (probably 1-2 weeks).

Thanks again all, and I am thrilled we can move to the next phase!

Thanks