intermine / pombemine

0 stars 1 forks source link

Modelling of annotation extensions #52

Open ValWood opened 2 years ago

ValWood commented 2 years ago

"Annotation extensions" are an increasingly important component of GO annotation because this is how we capture connections between terms (I.e the actual biology, and form the basis of our causal modelling

For example cdc2 (protein kinase) [[phosphorylates cdc15] [during mitotic S-phase] ] clp1 (protein phosphatase) [[dephsophorylates cdc2] [during mitotic M phase]]

This is the type of data we would really need to be able to interrogate but, because the terms are curated (and thus represented here) as a single extension it is not possible to separate out the datatypes to query independently.

This is what we see

Screenshot 2022-05-23 at 16 05 02

So we cannot extract either I) the biological targets of a specific gene/protein (i.e all of the targets of a kinase, or a transcription factor) Or ii) query on the phase of the cell cycle at which these occur.

An extension is actually often a compound annotation comprising numerous relations and datatypes stored in a single field of the GO-GAF.

I have a proposal to make these accessible in PomBeMine. First, I don't think it is practical within the time-constraints of the project to include all annotation extensions types we have used: i.e. 168 coincident_with 180 existence_overlaps 669 happens_during 1834 has_input 218 occurs_in 702 part_of

so I suggest that we try to model the 3 most useful, namely has_input (to give access to targets) happens during (to give access to phases) part of (to link molecular functions to processes)

This would require loading each gene list into a field describing it's extensions type. In PomBemine, instead of

Screenshot 2022-05-23 at 16 30 10

it would need to look like:

Gene DB identifier Ontology term Identifier Extension (during) Extension (target) Extension (Part of process)
SPBC11B10.09 GO:0004693 GO:0000084 SPAC17A5.07c NO DATA
SPBC11B10.09 GO:0004693 GO:0000084 SPAC18G6.10 NO DATA
SPBC11B10.09 GO:0004693 GO:0000084 SPAC20G8.05c NO DATA
SPBC11B10.09 GO:0004693 GO:0000080 SPAC22E12.19) NO DATA
SPBC11B10.09 GO:0004693 GO:0000084 SPAC22H10.11c NO DATA
SPBC11B10.09 GO:0004693 GO:0000089 SPCC736.14 GO:0090307
ValWood commented 2 years ago

So, we would need the ability to select different 'types' of extensions here:

Screenshot 2022-05-23 at 16 49 51
ValWood commented 2 years ago

Based on https://github.com/pombase/curation/issues/3269

I think we are probably loading https://curation.pombase.org/dumps/latest_build/misc/pombase_style_gaf.tsv for pombemine If so, could we switch to https://curation.pombase.org/dumps/latest_build/misc/go_style_gaf.tsv It is the same data but has the GO compliant (restricted set of extensions) (the other extensions we keep in PomBase for now as we need to convert them to something else which is not. yet available in our tool) Sorry for the confusion.

kimrutherford commented 2 years ago

Going on how InterMine models other things, I think it makes sense for annotations to have a collection of AnnotationExtensionPart objects because extensions can have multiple relations and targets.

An AnnotationExtensionPart could have a relation (term) and a range (a term, a gene or whatever).

That's more or less how we model things in the pombase.org code.

Extension (target)

In your example these are strings. If they were objects (Gene, Term etc.) the queries could be more powerful. You'd be able include the gene name or term name from an extension in the output of a query. Or constraint the term using the parents. (Like querying for annotation that have a during "meiotic cell cycle phase" extension but include annotations with during "meiotic prophase I" in the output)

ValWood commented 2 years ago

This would be very powerful. It would definitely be useful for Alliance @klarra @nuin and FlyMIne too.

danielabutano commented 2 years ago

@kimrutherford I need some clarifications, please The idea would be to change the model from `

`

to `

` My doubt is how I parse the term and gene in annotationExtension? Given these examples how i can map it? existence_overlaps(GO:0051329), -> AnnotationExtensionPart with a termRange(pointing to GO:0051329) and relation? occurs_in(SO:0001850), -> AnnotationExtensionPart with a termRange(pointing to SO:0001850) and relation? has_input(PomBase:SPAPJ760.03c) -> AnnotationExtensionPart with a geneRange(pointing to SPAPJ760.03c) and relation? Thanks!!

ValWood commented 2 years ago

@kimrutherford can you answer this, or is it better if we all chat about it together to make sure that we are all on the same page?

ValWood commented 2 years ago

Also, the way we have split out the. FYPO "annotation extensions" into 2 separate columns with different headings and value ranges is a model for how to handle annotation extensions.

However, I don't think we need to model every GO extension. Some extensions only act like a qualifier providing a bit of extra specificity about an annotation. I think we can omit these ones. The ones of real use for connection data, so I think we only need to load these 3 for now:

has_input (to give access to targets) happens during (to give access to phases) part of (to link molecular functions to processes)

We can skip the rest, they won't make any of the queries more powerful.

kimrutherford commented 2 years ago

Hi @danielabutano

I think your suggested model change makes except that the relation in AnnotationExtensionPart probably needs to be an OntologyTerm:

<class name="AnnotationExtensionPart" is-interface="true">
  <reference name="relation" referenced-type="OntologyTerm"/>
  <reference name="geneRange" referenced-type="Gene"/>
 <reference name="termRange" referenced-type="OntologyTerm"/>
</class>

The extension relations are mostly from the relation ontology (RO), BFO and from GOREL (https://github.com/geneontology/go-ontology/blob/master/src/ontology/extensions/gorel.obo).

How does that sound?

ValWood commented 2 years ago

I'm not sure which GO gaf you are loading https://curation.pombase.org/dumps/latest_build/misc/go_style_gaf.tsv or https://curation.pombase.org/dumps/latest_build/misc/pombase_style_gaf.tsv

for simplicity you should use https://curation.pombase.org/dumps/latest_build/misc/go_style_gaf.tsv

(the Pombase version uses some extra extensions, but they are mainly 'placeholders' and we don't need them in InterMine).

ValWood commented 2 years ago

Daniela I'm really sorry This isn't quite what I envisaged above. I'm sorry that you have spent your valuable time on it. It is not quite right although it is a large part of the way there.

It works so far that the relations and their entities/range are separated and queriable. However an important feature is missing. These extensions are connected to an annotation together. So a gene which is annotated to Cdc2 protein kinase has_target-dis1 during mitotic-metaphase But this is represented Cdc2 protein kinase during mitotic-metaphase Cdc2 protein kinase has_target-dis1

So we cannot select for targets that occur during particular cell cycle phases. This will only work if the different extension types are still connected to the same annotation row (exactly how penetrance and severity are modelled for phenotypes). In effect, we can think of extensions as connected but different datatypes that are shoe-horned into a single column of the GAF. We are trying to separate them out to make them useful. When this is available it should be very powerful for querying.

I'd like to revisit this one once further funding is available. It would be really useful for GO data generally (this is also the query that the Nurse lab wanted, and I wanted to include in my talk but I will do something else instead as time is so short).