intermine / pombemine

0 stars 1 forks source link

GO term -> annotation ignores ontology inferences #34

Closed ValWood closed 2 years ago

ValWood commented 2 years ago

I did a query on "chromosome segregation, and retrieved only one gene:

Screenshot 2022-02-23 at 16 40 01

Despite having 216 genes annotated to this process https://www.pombase.org/term/GO:0007059

This is because sds21 (SPCC31H12.05c) is the only gene annotated directly to this term. All of the other annotations are to descendants. Doesn't InteMine use the ontology structure in queries? I haven't checked any other mines but this appears to be a major oversight (this is probably on the wrong tracker if it is a general bug)

gmicklem commented 2 years ago

On Feb 23 2022, Val Wood wrote:

I did a query on "chromosome segregation, and retrieved only one gene:

Screenshot 2022-02-23 at 16 40 01

Despite having 216 genes annotated to this process https://www.pombase.org/term/GO:0007059

This is because sds21 (SPCC31H12.05c) is the only gene annotated directly to this team. All of the other annotations are to children. Doesn't InteMine use the ontology structure in queries? I haven't checked any other mines but this appears to be a major oversight (this is probably on the wrong tracker if it is a general bug)

This is not supposed to happen.

-- Reply to this email directly or view it on GitHub: https://github.com/intermine/pombemine/issues/34 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

rachellyne commented 2 years ago

@ValWood Have you tried using the parent term?

ValWood commented 2 years ago

Good!

I poked around a little more and found the 'direct' toggle under 'relations' I expected this would toggle on and off direct/inferred annotation but it is 'off' by default. Toggling on gives me 202 rows (closer), but they all contain SPCC31H12.05c as the associated gene.

rachellyne commented 2 years ago

Hmm - can't seem to upload the screenshot. If you add the constraint to the parent it returns 384 rows.

rachellyne commented 2 years ago

This is the query you want to run:

(Load it from the Import from XML option).

rachellyne commented 2 years ago

Grr github won't display xml either. I don't get this thing.

rachellyne commented 2 years ago
<query model="genomic" view="Gene.primaryIdentifier Gene.symbol Gene.goAnnotation.ontologyTerm.parents.name Gene.goAnnotation.ontologyTerm.parents.identifier Gene.goAnnotation.ontologyTerm.identifier Gene.goAnnotation.ontologyTerm.name" constraintLogic="(A)" sortOrder="" name="Custom_Gene_Query_1516719462">
   <constraint path="Gene.goAnnotation.ontologyTerm.parents.identifier" value="GO:0007059" op="=" code="A"/>
</query>
rachellyne commented 2 years ago

OK load the xml above!

ValWood commented 2 years ago

I tried at yeastmine and the behaviour is quite odd.

I get the direct annotations when I do the same query (the correct number 72, because SGD have annotated a number of genes less specifically than PomBase). The number of genes annotated to chomaosme segregation is similar for both species. Pombase has 216 and SGD has 221 annotations to chromosome segregation.

SGD has 73 direct and when I do the Intermine quey I get 72 (which is ball park correct).

If I toggle on the 'direct' toggle I suddenly get 14,616 rows.

ValWood commented 2 years ago

Right that is the correct result! How do I do that in the Query builder?

rachellyne commented 2 years ago

Did you try the query above? You need to constrain the parent term but still show the others.

ValWood commented 2 years ago

I don't really understand why constraining the parent gives me all of the descendants.

I would expect to see all annotations to a term by default, and then constrain to get either only direct annotations or different relationship types if necessary.

It isn't very intuitive!

Also, this query is from gene, so I cant do annotation retrieval queries directly from the ontology term, if I understand correctly?

rachellyne commented 2 years ago

I am pretty sure parent-child relationships are a standard way to refer to ontolgies?

It doesn't matter if you start from ontology or gene - I'll create one from ontology.

rachellyne commented 2 years ago
<query model="genomic" view="GOAnnotation.ontologyTerm.name GOAnnotation.ontologyTerm.identifier GOAnnotation.ontologyTerm.parents.name GOAnnotation.ontologyTerm.parents.identifier GOAnnotation.subject.primaryIdentifier GOAnnotation.subject.symbol" constraintLogic="(B)" sortOrder="" name="Custom_GOAnnotation_Query_-1670177853">
   <constraint path="GOAnnotation.ontologyTerm.parents.identifier" value="GO:0007059" op="=" code="B"/>
   <constraint path="GOAnnotation.subject" type="Gene"/>
</query>

This is the same query starting from GO annotation

ValWood commented 2 years ago

am pretty sure parent-child relationships are a standard way to refer to ontolgies?

I agree, but the standard behaviour using ontologies would usually be to see all annotations to a term (which is biologically correct) and toggle to see only annotations to the parent (this is a query that you don't really want to do as a user because it is biologically incomplete, although we do use it administratively when assessing annotations)

I guess in this context the terminology parent/child confuses me, usually in the ontology world we would use "direct" or "indirect" to describe whether annotations are directly to a term or to one of its sub-classes. Parent/ancestor/superclass are usually used in the context of the ontology, rather than the context of the annotations.

But I think the issue is that I don't know exactly what it means to "constrain" on a parent. Constrain on parent seems to mean "return all ontology subclasses of a term" which doesn't seem intuitive. I think this is largely to do with my lack of knowledge about how database queries are constructed, but the default behaviour of presenting only direct annotations disturbs me.

kimrutherford commented 2 years ago

Hi All.

For some background, in the PomBase query builder it's not even possible to retrieve only directly annotated genes. We always include the descendant annotations in the query results because that's makes sense biologically.

If a user asked to be able to query only direct annotations (I don't think that's happened yet) as a first response we would probably suggest that they use the GAF file directly.

rachellyne commented 2 years ago

If you ask for the GO annotations for a gene you get all the annotations. If you start from a term though you need to define whether you want that term or that term and all child terms. You could for instance want to only return genes from any specific term. This is really a coincidence of modelling the ontology structure and exposing it in the query builder - it means you can create any query!

rachellyne commented 2 years ago

I'll make some templates! I had already made a template:

GO term --> Genes and Annotation data

ValWood commented 2 years ago

Sorry to keep going on about this but ...

Firstly, It would be much more accurate biologically if the query default was to be inclusive and return all annotations. Is there no way to do this? In reality, all annotations could be made directly to the term (i.e we could generate direct annotations to the ancestor terms with the same evidence and provenance as the annotations to the children). However we don't bother to do this because if the ontology is used correctly users will always return the full complement of annotation- the ontology takes care of this for us.

At GO we always encourage tools to reason over the ontology at all times. The direct annotation query is an 'edge case', and not biologically relevant. The only time we use it in GO is for administrative purposes (i.e to check that an unnecessary grouping term has no direct annotations before obsoletion), or to see if an annotation to a grouping term can be migrated down to a more specific term.

The problem is illustrated by a query on chromosome segregation I get one gene product (which in this case is clearly incorrect, pombe is probably the best studied species for chromosome segregation!), this is a consequence of making very specific annotation.

Increasingly GO will block high level terms for direct annotation (to encourage more specific annotation). We routinely make terms 'not for direct annotation' if we think it should always be possible to be more specific. For instance translation (mitochondrial or cytosolic), cell cycle (mitotic or meiotic) etc. So increasingly queries will give zero annotations if the descendants are not included in a default search.

So it is really worth thinking about the default behaviour (filtering in annotation to descendants rather than filtering in) . Otherwise the user need to know GO really well to realise that their results are very incomplete.

Secondly, I think the thing I can't get my head around for the current behaviour is the labelling and behavior of the filter:

Screenshot 2022-02-24 at 12 45 04

I don't understand "only show ontology terms if they have an associated ontology term" but All ontology terms have associated ontology terms?

Should the description say something like "show descendant/subclass ontology terms?"

(I think I'm most confused by the use of "parent" in this context, because we are really retrieving 'descendant' terms not 'parent' terms with this filter (the opposite) . Anyway, I will ask Kim to explain it to me when we chat tomorrow morning to see if I can understand the current behaviour).

rachellyne commented 2 years ago

@ValWood There isn't really any default behaviour. We model the ontology structure which gives a lot of power but hide that complexity for users through templates. The query builder exposes the model. We would have to change how we model ontologies to change this but would lose a lot in the process I think - happy to brainstorm if you want, there may be a better solution. This is core model so would affect all InterMines (and all ontologies). The query you want is show all terms/genes where the parent is X.

"only show ontology terms if they have an associated ontology term" - that is a bit confusing. These are not configured individually for each case but generated based on the classes. Usually, it make sense, but here you have an Ontology term referencing ontology term so it sounds confusing. I can ask Kevin if there is a better way!

ValWood commented 2 years ago

Right I now understand why the query is the way it is. for me, for ontologies the query would be more intuitively: "show all subclasses of ontology term" And the referencing class label would be "subclasses" (or "children/descendants" but 'subclasses is generally preferred) and not "Parents"

"Parents" tend to be used when you are traversing up the graph, not down.

ValWood commented 2 years ago

There isn't really any default behaviour.

right, but is it possible to use the "Add summary" button to repopulate the query with sensible fields that includes the "parent" option. It's tricky to locate the correct part of the query among all of the options.

rachellyne commented 2 years ago

I think we might be able to add "parent". I am trying to think if it would cause problems elsewhere as the summary fields are used to define tables and displays in the report pages.

ValWood commented 2 years ago

I now have queries that work correctly! Thanks @kimrutherford There is a very minor bug that some identical arrows are duplicated, but the results are correct. We didn't use the "direct" option.