nfdi4plants / nfdi4plants_ontology

An intermediate ontology for plants used by DataPLANT to fill the ontology gap
MIT License
7 stars 8 forks source link

[Import Ontology] ncbitaxon light/min #72

Closed Freymaurer closed 9 months ago

Freymaurer commented 9 months ago

Ontology

ncbitaxon

Please state the reason to import this ontology into the SwateDB

We currently feature the full ncbitaxon ontology which results in greater difficulties designing performant parent-child search functions. I therefore propose replacing ncbitaxon with a light/min version. For an ontology of this sort I suggest the names:

I am currently creating this ontology following the following lines of thought:

Let me know what you think of this approach! Best case i would get some quick replies to finish this tomorrow.

@muehlhaus @Brilator @AngelaKranz @Hannah-Doerpholz @kdumschott @StellaEggels

Freymaurer commented 9 months ago

This is a preview of the terms from highest to lowest:

image

Brilator commented 9 months ago

and remove their existing is_a relationships.

what are those "is_a"?

Sounds good to me, if that helps to speed up search. Organisms not listed in those 20.000 top hits could still be added by hand or to your ncbitaxonmin.obo. Just be careful with ontology version. Not sure how frequent, but NCBITaxon is supposedly changing once a species is moved to a new taxum. So the approach needs to be update, reproduced upon NCBItaxon changes.

Hannah-Doerpholz commented 9 months ago

The main issue I see with this is that while you take the top entries, you are not filtering for plant organisms (as I understood it). Since those would be the most important for DataPLANT I don't think this approach is the best. If there is a way to filter for plants that would be great. Also, for the microbiologists that are using Swate, microorganisms might also be important. However, I also see entries for viruses and several other animals in there that (in my opinion) are not relevant for DataPLANT

Hannah-Doerpholz commented 9 months ago

If you have a look here, it might be worth filtering for some of these branches, definitely for Viridiplantae. I am however no taxonomist, so there might be important organisms in other branches as well that we should be including. Bildschirmfoto von 2023-10-26 15-37-58

Freymaurer commented 9 months ago

Sounds good to me, if that helps to speed up search. Organisms not listed in those 20.000 top hits could still be added by hand or to your ncbitaxonmin.obo.

Correct!

Just be careful with ontology version. Not sure how frequent, but NCBITaxon is supposedly changing once a species is moved to a new taxum. So the approach needs to be update, reproduced upon NCBItaxon changes.

I can add the scripts i used to the repo

The main issue I see with this is that while you take the top entries, you are not filtering for plant organisms (as I understood it). Since those would be the most important for DataPLANT I don't think this approach is the best. If there is a way to filter for plants that would be great. Also, for the microbiologists that are using Swate, microorganisms might also be important. However, I also see entries for viruses and several other animals in there that (in my opinion) are not relevant for DataPLANT

Is this information we can get from ncbitaxon, maybe by doing step 2 of my workflow against a subset of the ncbitaxon ontology which are children of plant and microorganisms ? Is this something you can tell me?

Hannah-Doerpholz commented 9 months ago

Is this information we can get from ncbitaxon, maybe by doing step 2 of my workflow against a subset of the ncbitaxon ontology which are children of plant and microorganisms ? Is this something you can tell me?

I have attached an image above, hopefully it was sent correctly. My internet is a bit unstable right now. For bacteria I'm not sure what should be included. Sabrina or Angela probably know more about this

StellaEggels commented 9 months ago

In addition to plants and bacteria, you might also want to include algae (including Crytpophyceae, Rhodophyta, Glaucocystophyceae) and fungi (Opisthokonta > Fungi). Maybe it is easier and sufficient to exclude animals (Opisthokonta > Metazoa)?

Freymaurer commented 9 months ago

ncbitaxonmin.zip

Brilator commented 9 months ago

This approach just threw out my favorite species "Talinum fruticosum" that I always use in trainings :(

kdumschott commented 9 months ago

Don't Panic! I'll add it for you manually.

Freymaurer commented 9 months ago

i just checked and it seems to have worked out fine! I Can already find Talinum fruticosum in the search again.

image

kdumschott commented 9 months ago

Yes, I've added it to the ncbitaxon.min_plus.obo file you made Kevin. Unfortunately, it's without the full annotation as Protege keeps crashing when I try and open the ncbitaxon file to import the entire term, but it's at least there to be used for annotating metadata sheets. I'll try and sort that out when I have a spare minute.

Brilator commented 9 months ago

Thanks