pombase / canto

The PomBase community curation tool
https://curation.pombase.org
Other
19 stars 7 forks source link

Brenda Tissue Ontology terms aren't suggesting correctly #2183

Closed jseager7 closed 4 years ago

jseager7 commented 4 years ago

Since a recent ontology reload, Canto suggests different terms when searching for the BTO term 'leaf':

image

The closest (and intended) match for this query is 'leaf' (BTO:0000713), and that was previously included in term suggestions.

I'm guessing this is because 'leaf' has a part_of relation to other terms, rather than an is_a relation:

id: BTO:0000713 name: leaf namespace: BrendaTissueOBO def: "A lateral outgrowth from a plant stem that is typically a flattened expanded variably shaped greenish organ, constitutes a unit of the foliage, and functions primarily in food manufacture by photosynthesis." [From_Merriam-Webster's_Online_Dictionary_at_www.Merriam-Webster.com:http\://www.m-w.com/cgi-bin/dictionary?book=Dictionary&va=leaf] relationship: part_of BTO:0001243 ! shoot relationship: part_of BTO:0001495 ! plumule

Our annotation extension configuration file only specifies part_of as a relation, but according to the wiki, this subset relation only applies to the domain ontology (PHIPO), not the range ontology (BTO), so this might not be anything to do with the problem.

domain ID subset relation extension relation range ID Canto display text Help text cardinality role
PHIPO:0000001 is_a infects_tissue BTO:0000000 host tissue infected (help text) * user

@kimrutherford Could this simply be a bug with the term lookup system?

ValWood commented 4 years ago

I would guess that if we use is_a and part of we should be able to access everything. Although I thought that everything should have an is_a relationship?

jseager7 commented 4 years ago

I would guess that if we use is_a and part of we should be able to access everything. Although I thought that everything should have an is_a relationship?

Everything in PHIPO is related with is_a (as far as I know), but the problem seems to be that the term lookup for the range ontology (BTO) isn't able to cope with part_of relations. I've tested this by looking up terms from BTO that have a normal subclass relation, compared to 'subclass of part of', and every example of the latter doesn't return results in Canto, whereas examples of the former do.

Basically, the term suggestion only seems to be taking BTO's is_a hierarchy into account.

jseager7 commented 4 years ago

I could test changing the subset relation to part_of, or a conjunction of is_a|part_of (if that's even possible), but the documentation specifically states that the subset relation only matches relations in the domain ontology (PHIPO):

subset relation - The relation used to find domain term descendants (usually "is_a")

kimrutherford commented 4 years ago

could test changing the subset relation to part_of, or a conjunction of is_a|part_of (if that's even possible),

I believe "is_a|part_of" is possible. Could you try it? If it works for you I'll update the documentation to mention it. :-)

jseager7 commented 4 years ago

I believe "is_a|part_of" is possible. Could you try it?

I'm not sure exactly how I can prove that this works, short of creating a test ontology just for this purpose, because PHIPO has no part_of relations. I could try and run an experiment locally with a minimal test ontology if you want me to test this regardless.

Just to clarify, the problem I'm trying to solve is that terms from the range ontology (BTO) that are related by part_of are not showing up in the term suggestion. There's no option in the config to specify a subset relation for the range ontology; you can only specify subset relations for the domain ontology. I need the term suggestion to pull in terms related by part_of.

ValWood commented 4 years ago

What is "range ontology" and "domain ontology" in this context?

jseager7 commented 4 years ago

What is "range ontology" and "domain ontology" in this context?

I'm using the terminology from the annotation extension configuration, specifically the tab-separated value (TSV) file. In the example below, you can see the columns 'Domain ID' and 'Range ID'. They're analogous to the domain and range of a mathematical function:

domain ID subset relation extension relation range ID Canto display text Help text cardinality role
PHIPO:0000001 is_a infects_tissue BTO:0000000 host tissue infected (help text) * user

The domain ontology specifies the terms to which you can apply an annotation extension. In this case, it means that only the children of PHIPO:0000001 can be used as the 'starting point' for the infects_tissue relation. The range ontology is the set of terms (or values) that can be selected as the 'end point' for the relation: those are the options you can pick from when creating the annotation extension.

So, infects_tissue is a relation that has PHIPO terms as its domain and BTO terms as its range. The domain ontology is PHIPO and the range ontology is BTO.

ValWood commented 4 years ago

OK, I understand but usually domain and range refer to the same ontology.

But this just means that (presumably) andy BTO tissue ID can be used with any PHIPO term (I.e both are specifying root nodes). So I think the fix is just to include part_of in the subset relation?

jseager7 commented 4 years ago

So I think the fix is just to include part_of in the subset relation?

Unfortunately the annotation extension documentation states that the subset relation only applies to the domain ontology (link). There doesn't seem to be anything I can do to specify a subset relation for the range ontology, so I can't get Canto to include part_of terms from BTO (and most of the terms are related with part_of). We'd need an extra column for 'range subset relation' or something.

ValWood commented 4 years ago

OK got it....I guess this would not be a problem if BTO was is_a complete... that's possibly why we never had the issue before? @mah11 is that right?

mah11 commented 4 years ago

but usually domain and range refer to the same ontology.

this isn't true

I guess this would not be a problem if BTO was is_a complete

probably, yes

kimrutherford commented 4 years ago

Just to clarify, the problem I'm trying to solve is that terms from the range ontology (BTO) that are related by part_of are not showing up in the term suggestion. There's no option in the config to specify a subset relation for the range ontology; you can only specify subset relations for the domain ontology.

Ah, right. I've got it now. :-)

Yep, there's no option at the moment. I'd prefer no to add one if we can avoid it at it makes the config (and code) even more complicated.

I'm guessing this is because 'leaf' has a part_of relation to other terms, rather than an is_a relation:

Sorry I should have thought more before answering. I don't think that the relations between terms should make a difference to the autocomplete as it only looks at names and synonyms. Could you send me a link to the BTO OBO file you're loading so I can try this locally?

jseager7 commented 4 years ago

Could you send me a link to the BTO OBO file you're loading so I can try this locally?

We're loading this version of BTO:

https://raw.githubusercontent.com/BRENDA-Enzymes/BTO/master/bto.obo

There's also other release formats on the repository, but most of them seem to be equivalent to bto.obo. The bto-simple.obo file is practically empty.

kimrutherford commented 4 years ago

I'm guessing this is because 'leaf' has a part_of relation to other terms, rather than an is_a relation:

Thanks for the link to BTO. I think you're right - the problem is that leaf has no is_a parents. Because of that the loading code puts "leaf" in the internal canto_root_subset. Terms in that subset are ignored during extension autocomplete.

The default configuration is:

 ontology_namespace_config:
   subsets_to_ignore:
     primary_autocomplete:
       - "is_a(Grouping_terms)"
     primary_select:
       - "is_a(Grouping_terms)"
       - "is_a(qc_do_not_annotate)"
     extension:
       - "is_a(canto_root_subset)"
       - "is_a(qc_do_not_annotate)"
   do_not_annotate_subsets:
     - "is_a(canto_root_subset)"
     - "is_a(qc_do_not_annotate)"

We might be able to change the loading code so that terms with a part_of parent aren't added to canto_root_subset. That may change the root subset for other ontologies but should be fine for FYPO and GO at least.

jseager7 commented 4 years ago

That may change the root subset for other ontologies but should be fine for FYPO and GO at least.

It won't affect PHIPO either, since we don't use part_of.

Could the default config add part_of to the list of subsets to ignore? Then there would be no harm in including part_of relations in the loading code by default. For example:

primary_autocomplete:
  - "is_a(Grouping_terms)"
  - "part_of(Grouping_terms)"
kimrutherford commented 4 years ago

For example:

I don't think that would help in your case. It's the

subsets_to_ignore -> extension -> "is_a(canto_root_subset)"

setting that's causing this.

Let's have a chat about this on Skype.

kimrutherford commented 4 years ago

James, we should talk about this on the next call. I'm not sure there is an easy fix. It's frustrating that BTO doesn't have a top level term like "tissue". Having "leaf" at the top level seems a bit strange.

kimrutherford commented 4 years ago

It's frustrating that BTO doesn't have a top level term like "tissue". Having "leaf" at the top level seems a bit strange.

I take that back. "leaf" isn't top level because although it has no is_a parents, it's part_of(shoot) and part_of(plumule). It's only top-level if consider just is_a relations, which is what Canto does.

jseager7 commented 4 years ago

Is it feasible for Canto to use a different relation as its 'dominant' relation, on a per-ontology basis? I don't know how common it is for ontologies to be structured with part_of, but I suspect BTO won't be the only one.

mah11 commented 4 years ago

Is it feasible for Canto to use a different relation as its 'dominant' relation, on a per-ontology basis?

I really don't recommend spending any time or effort trying to do this. The "dominance" of is_a in Canto reflects how the relation is used in OBO ontologies (corresponding to the SubClassOf relation in OWL). Diverging from that feels dangerous.

I don't know how common it is for ontologies to be structured with part_of, but I suspect BTO won't be the only one.

As far as I know it's not rare for ontologies to use the part_of relation, but beyond that I don't think I fully understand what you mean by "structured with part_of". BTO is a bit unusual among the OBO Foundry ontologies in lacking is_a paths to the root for some terms. In an ideal world the BTO maintainers would fill in the missing is_a links; in real life I don't know if they have the resources to make that a priority.

jseager7 commented 4 years ago

BTO is a bit unusual among the OBO Foundry ontologies in lacking is_a paths to the root for some terms.

That's what I meant by 'structured', in that I presumed there was some way to reach every term by a part_of relation, but not by is_a, meaning part_of is like the backbone of the ontology. I couldn't think of any more precise wording at the time.

ValWood commented 4 years ago

Can't the config file be used to allow both part_of and is_a terms? Presumably it is cardinality 1 at the moment but could that not be extended to is_a,part_of ?

mah11 commented 4 years ago

Can't the config file be used to allow both part_of and is_a terms?

This was addressed earlier in this ticket.

mah11 commented 4 years ago

That's what I meant by 'structured', in that I presumed there was some way to reach every term by a part_of relation, but not by is_a, meaning part_of is like the backbone of the ontology.

Ah, ok, thanks for the clarification. In that case, it is indeed rare. Even BTO uses is_a to organize its most general levels; it just also has lots of more specific terms that are only connected to the basic stucture by part_of (it isn't what GO folks would call "is_a-complete"). IOW it's not that BTO uses part_of instead of is_a fundamentally.

ValWood commented 4 years ago

I re-read and I still don't understand why for BTO extensions the config file cannot specify is_a,part_of

jseager7 commented 4 years ago

I re-read and I still don't understand why for BTO extensions the config file cannot specify is_a,part_of

We'd need a second 'subset relation' column in the extension config file that applies to the range ontology (BTO). The current 'subset relation' column only tells Canto how to select terms from PHIPO. The range ontology (BTO) always defaults to is_a.

jseager7 commented 4 years ago

If we had a second 'subset relation' column for the range ontology, we could do this:

domain ID domain relation extension relation range ID range relation Canto display text Help text cardinality role
PHIPO:0000001 is_a infects_tissue BTO:0000000 is_a|part_of host tissue infected (help text) * user

But currently there is no 'range relation' column; there's no way to specify the relation for the range ontology (BTO).

ValWood commented 4 years ago

AH got it! Can we cheat? Substitute part_of for is_a on import. We don't displaying these relations or doing anything which would confuse users.

I guess it depends how big an effect it would be changing range relation to to is_a|part_of more generally, and I don't have a good feel for that at all.

jseager7 commented 4 years ago

AH got it! Can we cheat? Substitute part_of for is_a on import. We don't displaying these relations or doing anything which would confuse users.

I'll have to defer to @kimrutherford but I suspect adding a column would be as simple (or difficult) as any cheat to work around the problem, because the configuration table above is processed at the same time as the ontologies are loaded.

I guess it depends how big an effect it would be changing range relation to to is_a|part_of more generally, and I don't have a good feel for that at all.

If it would fix the term suggestion problem that this issue was originally about (see https://github.com/pombase/canto/issues/2183#issue-552906311), then that would be a pretty big effect.

However, I can't really anticipate what the child term suggestion process would be like if we're also showing part_of relations there. We'd probably want some way for the interface to indicate which child terms are related by part_of and which are related by is_a, otherwise it could get confusing.

kimrutherford commented 4 years ago

I'll have to defer to @kimrutherford but I suspect adding a column would be simpler than any cheat to work around the problem,

I think you're probably right that a new column for "range subset relation". At the moment the code uses the domain subset relation as the range subset relation. That's not a great work by me. Luckily the domain subset relation is always "is_a" so it all works fine, except for BTO.

Adding a new column and implementing a range subset relation is a bit of a pain so I'm going to have a think before going ahead. I'm hoping there's a simpler solution but I'm doubtful.

jseager7 commented 4 years ago

There's now an issue open on the BTO repository, at https://github.com/BRENDA-Enzymes/BTO/issues/10, that proposes fixing the class hierarchy in BTO itself.

jseager7 commented 4 years ago

I've tried the simple fix of forcing all of the relations to is_a, and it seems to work well, providing you view the child terms as simply more specific concepts of the parents. Shown below is how 'leaf' fits into the hierarchy now. I'm planning to test this fix on the demo server soon.

bto_coerce_is-a

kimrutherford commented 4 years ago

Thanks James. I think that's probably a good quick fix for now.

jseager7 commented 4 years ago

The modified version of BTO is loaded on the demo server now, and it looks like the suggestions work as expected again, or maybe even better than they did originally:

image

The child term suggestions are working too:

image

The child suggestions are sometimes unexpected – for example 'bile' (BTO:0000121) as a more specific term for 'liver' (BTO:0000759) – but the only bad thing about this is that it clutters up the suggestion list with terms that might be rarely used. However, it might be better for users if we show everything that could be related to the parent term, in which case this isn't a problem at all.

jseager7 commented 4 years ago

Before l load this on the production server, I'd like to get the ontology loading problems in issue #2288 fixed first, providing it's not going to take a long time to fix the problem.

jseager7 commented 4 years ago

This is fixed on both servers now, so I'm going to close this issue. If anyone has any problems with the way the ontology works now (@CuzickA in particular), feel free to re-open.