monarch-initiative / GENO-ontology

Repository for representing genotypes and their association with phenotypes
18 stars 6 forks source link

promoters driving expression #20

Closed nlwashington closed 7 years ago

nlwashington commented 9 years ago

we need to have some examples of how to model promoters of one gene driving the expression of another.

for example, there is a zfin construct Tg(zp3:fsta,myl7:EGFP) where the promoter of zp3 is driving the expression of fsta.

i am not sure how to capture this with GENO modeling. we have the identifiers for all of them.

my guess is that i might need to make a node that is a genomic feature for "promoter of zp3", which doesn't exist. maybe that would look something like:

:_promoter_of_zp3 a SO:0000167
    RO:regulates ZFIN:ZDB-GENE-991129-7

but, then how would we say that this promoter also regulates the expression of fsta? it clearly only does so because of the construct, and isn't a "wildtype" property. also, the gene that is being expressed is a wildtype gene, but it is just that it might be in a different time/place/abundance...so it's sequence isn't variant, rather it's expression is variant.

does the construct itself become an "expression variant", with some expression-altered locus?

:_zp3:fsta a GENO:0000485    # expression-altered locus
    GENO:has_expression-variant_part  ZFIN:ZDB-GENE-990714-11    # fsta

but i am not sure with this model how to reference anything about the promoter of zp3.

but we can also say:

:_zp3:fsta a GENO:0000485 
    GENO:is_expression_variant_of ZFIN:ZDB-GENE-990714-11

so, i'm not sure which is right, or if it's complete. help @mbrush !

pnrobinson commented 9 years ago

Hi everybody, be aware that the concept of promoter is per se difficult and there are lots of regions that seem to be bidirectional. Here is an example of a study on a related topic by a colleague of mine here:

http://www.ncbi.nlm.nih.gov/pubmed/25639469

Therefore, I would tend to avoid using ontological concepts like "promoter_of_zp3". I do not really have an idea of the best way to model things like this though...

-Peter

Dr. med. Peter N. Robinson, MSc. Professor of Medical Genomics Professor in the Bioinformatics Division of the Department of Mathematics and Computer Science of the Freie Universität Berlin Institut für Medizinische Genetik und Humangenetik Charité - Universitätsmedizin Berlin Augustenburger Platz 1 13353 Berlin Germany +4930 450566006 Mobile: 0160 93769872 peter.robinson@charite.de http://compbio.charite.de http://www.human-phenotype-ontology.org Introduction to Bio-Ontologies: http://www.crcpress.com/product/isbn/9781439836651 I have learned from my mistakes, and I am sure I can repeat them exactly ORCID ID:http://orcid.org/0000-0002-0736-9199 Scopus Author ID 7403719646 Appointment request: http://doodle.com/pnrobinson


Von: Nicole Washington [notifications@github.com] Gesendet: Dienstag, 12. Mai 2015 19:15 An: monarch-initiative/GENO-ontology Betreff: [GENO-ontology] promoters driving expression (#20)

we need to have some examples of how to model promoters of one gene driving the expression of another.

for example, there is a zfin construct Tg(zp3:fsta,myl7:EGFP)http://zfin.org/ZDB-TGCONSTRCT-121211-1 where the promoter of zp3http://zfin.org/ZFIN:ZDB-GENE-991129-7 is driving the expression of fstahttp://zfin.org/ZDB-GENE-990714-11.

i am not sure how to capture this with GENO modeling. we have the identifiers for all of them.

my guess is that i might need to make a node that is a genomic feature for "promoterhttp://sequenceontology.org/browser/current_svn/term/SO:0000167 of zp3", which doesn't exist. maybe that would look something like:

:_promoter_of_zp3 a SO:0000167 RO:regulates ZFIN:ZDB-GENE-991129-7

but, then how would we say that this promoter also regulates the expression of fsta? it clearly only does so because of the construct, and isn't a "wildtype" property. also, the gene that is being expressed is a wildtype gene, but it is just that it might be in a different time/place/abundance...so it's sequence isn't variant, rather it's expression is variant.

does the construct itself become an "expression variant", with some expression-altered locus?

:_zp3:fsta a GENO:0000485 # expression-altered locus GENO:has_expression-variant_part ZFIN:ZDB-GENE-990714-11 # fsta

but i am not sure with this model how to reference anything about the promoter of zp3.

but we can also say:

:_zp3:fsta a GENO:0000485 GENO:is_expression_variant_of ZFIN:ZDB-GENE-990714-11

so, i'm not sure which is right, or if it's complete. help @mbrushhttps://github.com/mbrush !

— Reply to this email directly or view it on GitHubhttps://github.com/monarch-initiative/GENO-ontology/issues/20.

mbrush commented 9 years ago

Regarding:

"the gene that is being expressed is a wildtype gene, but it is just that it might be in a different 
 time/place/abundance...so it's sequence isn't variant, rather it's expression is variant"

My response here is that such a case of a WT human gene inserted as a transgene into a zebrafish genome represents a sequence-variant locus. Not an expression variant. This is consistent with how these terms are defined/used in GENO (and also with the SO notion of a sequence feature). My rationale is below.

As variant loci in GENO are 'sequence features', they are defined by their sequence and position. Therefore, even though some human transgene inserted in a zebrafish genome may represent a WT sequence of the human gene, it is still a sequence variant locus in the context of the zebrafish genome. It is a novel insertion that is not normally there - this is what matters from the perspective of the fish.

So as noted, there is no need to use the expression-variant locus class to describe these variants as you suggested. They are sequence-variant loci like any inserted transgene. And to clarify, the notion of an 'expression variant locus' is used to describe an endogenous gene (in its host genome) that is altered in its level of expression rather than its sequence or position (as the result of some experimental manipulation like morpholino treatment).

The other thing to consider carefully is how to propagate phenotypes to the human gene represented in the WT Tg insertion. In this case I think we just want to propagate phenotypes to the 'gene' class IRI - esp because I don’t think we represent instances of WT gene alleles in our MOD datasets. In most cases we use the punned gene class IRI as a stand in for this notion of a canonical gene. But the utility of this convention is an entirely separate issue.

In the example here, it makes biological sense to say that if you over- or erroneously express a human gene in a zebrafish and the fish shows some phenotype, there may be a link between the gene and the phenotype. This is the inference we would want to get made in the data.

mbrush commented 9 years ago

Regarding:

"but, then how would we say that this promoter also regulates the expression of fsta?"

By the same reasoning applied in my comments above, I can say that 'this promoter' _does not_ actually also regulate the ftsa gene. 'This promoter' here refers to a specific instance of a feature that is found in a zebrafish genome. It is a variant feature in this context - and it is a different entity than the canonical zp3 promoter (these would have different IRIs). This is because, as noted, sequence feature instances are identified/defined by their sequence AND their position.

What we want then is a link between the zp3 promoter in the fish Tg ('this promoter'), and an IRI representing the canonical zp3 promoter. This is where the GENO property 'derives_sequence_from' is used:

<zp3_promoter_in_fish_Tg>    GENO:derives_sequence_from  <canonical zp3_promoter> (assuming we represent this concept with an IRI)

I think/hope with this approach we can answer any queries/use cases around linking promoters in transgenes to the genes they canonically regulate.

mbrush commented 9 years ago

Finally, regarding then need for a link from Tg promoters to the genes they drive expression of in the transgene, the current modeling approach in GENO supports this by creating a node for the promoter (a regulatory feature) and a node for the expressed gene (a coding feature), and linking these both as parts the same Tg feature (the transgenic insertion) node. We implemented roughly this model in the zfin genotype data.

The structure of this graph would look something like this: tg_model_0001

Note that there is no direct link from the promoter feature node to the expressed feature node, or to the IRI of the gene it regulates. But these could be added if it is useful (this will depend on our query/analysis use cases in this area).

And note again that these are all sequence-variant features, not expression-variant features, as noted in my previous comment.

nlwashington commented 9 years ago

yes, i think it would be a very good thing to have a link between the promoter and coding element. we need to think about this generally, even in the context of the normal genome as well. would RO:regulates be sufficient?

mbrush commented 9 years ago

RO:regulates holds between two processes, so it would not apply here. The closest relation I see in the RO that might work is RO_0002448 'molecularly controls'.

The issue here is that RO_0002448 is a relation between two material entities, and the domain and range in our use case are abstract sequence features. This issue comes up again and again in our work, and can be overcome if we play loose and allow relations intended to hold between material entities to also hold between sequences that are borne by such material entities. This would keep us from having to create separate properties with the same intended meaning for sequences and their material bearers. Curious as to @cmungall and @mellybelly thoughts here, and if we can discuss this issue more broadly.

pnrobinson commented 9 years ago

I disagree that a promoter is an abstract sequence entity. It is a material entity that we choose to represent with a series of letters, but that does not make it abstract. Perhaps we are uncertain exactly where the entity begins and end, but that does not mean that it does actually have a beginning and end. Gene expression on the other hand is a little abstract, since we actually mean the number of physical entities produced per unit of time. Not sure that RO is our best bet in modeling all of this... -Peter

Dr. med. Peter N. Robinson, MSc. Professor of Medical Genomics Professor in the Bioinformatics Division of the Department of Mathematics and Computer Science of the Freie Universität Berlin Institut für Medizinische Genetik und Humangenetik Charité - Universitätsmedizin Berlin Augustenburger Platz 1 13353 Berlin Germany +4930 450566006 Mobile: 0160 93769872 peter.robinson@charite.de http://compbio.charite.de http://www.human-phenotype-ontology.org Introduction to Bio-Ontologies: http://www.crcpress.com/product/isbn/9781439836651 I have learned from my mistakes, and I am sure I can repeat them exactly ORCID ID:http://orcid.org/0000-0002-0736-9199 Scopus Author ID 7403719646 Appointment request: http://doodle.com/pnrobinson


Von: mbrush [notifications@github.com] Gesendet: Mittwoch, 1. Juli 2015 21:58 An: monarch-initiative/GENO-ontology Cc: Robinson, Peter Betreff: Re: [GENO-ontology] promoters driving expression (#20)

RO:regulates holds between two processes, so it would not apply here. The closest relation I see in the RO that might work is RO_0002448http://www.ontobee.org/browser/rdf.php?o=RO&iri=http://purl.obolibrary.org/obo/RO_0002448 'molecularly controls'.

The issue here is that RO_0002448 is a relation between two material entities, and the domain and range in our use case are abstract sequence features. This issue comes up again and again in our work, and can be overcome if we play loose and allow relations intended to hold between material entities to also hold between sequences that are borne by such material entities. This would keep us from having to create separate properties with the same intended meaning for sequences and their material bearers. Curious as to @cmungallhttps://github.com/cmungall and @mellybellyhttps://github.com/mellybelly thoughts here, and if we can discuss this issue more broadly.

— Reply to this email directly or view it on GitHubhttps://github.com/monarch-initiative/GENO-ontology/issues/20#issuecomment-117807486.

mbrush commented 9 years ago

Re: a promoter being an abstract entity - My apologies for not being clear here. I meant this in the ontological sense, in that all sequence features are modeled in our data as generically dependent continuants (GDCs) in the BFO sense - non-physical entities whose existence is dependent on material bearers. That is, we choose to model the abstract/information content inherent in a physical sequence molecule, rather than molecule itself. Promoters also exist at the physical level off course, as there are material stretches of DNA in our cells that promote gene expression. But the ontological commitment in our model treats them as abstract GDCs (as does the Sequence Ontology).

Accordingly, it is technically incorrect to use a property such as RO:'regulates' or RO:'molecularly controls', whose domains and ranges are explicitly specified as processes or material entities, respectively, to connect two generically dependent continuants. But we dont want to have to create separate properties with the same meaning for materials and GDCs, esp given that a sequence feature GDC is inextricably tied to a material entity that bears it. This prompted my question for @cmungall as to the most practical solution for this problem.

pnrobinson commented 9 years ago

There are multiple philosophical problems with the way we describe biology. FOr instance, a sentence like "SMAD2 goes into the nucleus" treats the cell like a PowerPoint diagram with arrows, and acts as if there is one entity called SMAD2 that volitionally goes from one place in the cell to another, whereas actually there are equilibria and millions of random movements whose distribution is changed by other equilibria. I think that our way of conceptualizing reality is useful but pretty far off from what is actually happening, and it really doesnt make too much sense to worry about whether we conform to the semantics in RO or not -- we are wrong anyway. The main issue is that we create useful models that "do the right thing". -Peter

Dr. med. Peter N. Robinson, MSc. Professor of Medical Genomics Professor in the Bioinformatics Division of the Department of Mathematics and Computer Science of the Freie Universität Berlin Institut für Medizinische Genetik und Humangenetik Charité - Universitätsmedizin Berlin Augustenburger Platz 1 13353 Berlin Germany +4930 450566006 Mobile: 0160 93769872 peter.robinson@charite.de http://compbio.charite.de http://www.human-phenotype-ontology.org Introduction to Bio-Ontologies: http://www.crcpress.com/product/isbn/9781439836651 I have learned from my mistakes, and I am sure I can repeat them exactly ORCID ID:http://orcid.org/0000-0002-0736-9199 Scopus Author ID 7403719646 Appointment request: http://doodle.com/pnrobinson


Von: mbrush [notifications@github.com] Gesendet: Dienstag, 14. Juli 2015 22:54 An: monarch-initiative/GENO-ontology Cc: Robinson, Peter Betreff: Re: [GENO-ontology] promoters driving expression (#20)

Re: a promoter being an abstract entity - My apologies for not being clear here. I meant this in the ontological sense, in that all sequence features are modeled in our data as generically dependent continuants (GDCs) in the BFO sense - non-physical entities whose existence is dependent on material bearers. That is, we choose to model the abstract/information content inherent in a physical sequence molecule, rather than molecule itself. Promoters also exist at the physical level off course, as there are material stretches of DNA in our cells that promote gene expression. But the ontological commitment in our model treats them as abstract GDCs (as does the Sequence Ontology).

Accordingly, it is technically incorrect to use a property such as RO:'regulates' or RO:'molecularly controls', whose domains and ranges are explicitly specified as processes or material entities, respectively, to connect two generically dependent continuants. But we dont want to have to create separate properties with the same meaning for materials and GDCs, esp given that a sequence feature GDC is inextricably tied to a material entity that bears it. This prompted my question for @cmungallhttps://github.com/cmungall as to the most practical solution for this problem.

— Reply to this email directly or view it on GitHubhttps://github.com/monarch-initiative/GENO-ontology/issues/20#issuecomment-121382671.

mbrush commented 9 years ago

Yes, I would agree and think Chris would as well, that we should be pragmatic when it comes to how constrained we are by the semantics of RO. As long as we avoid any problematic reasoning consequences, we can be flexible with how we apply RO properties, and/or refine their meaning to be more general where it doesnt break anything else (i.e. the model does the right thing for everyone) .