monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
576 stars 72 forks source link

Gene entities without annotator available #140

Open digrigor opened 1 year ago

digrigor commented 1 year ago

Hi!

I am trying to define my own extraction model using LinkML (more like editing the already existing templates) that I will then use to extract gene associations to ontologies from these 2 sources: http://purl.obolibrary.org/obo/opl.owl and http://purl.obolibrary.org/obo/wbphenotype.owl.

My issue is that there are no annotators for the gene IDs i would like to extract. These gene IDs are from parasitic worms hence not in gilda or bioportal:hgnc etc.

Is there any way I could locally set-up an "annotator" for my genes (I already have a list with all of them) that I could then use in my template?

Many thanks!

caufieldjh commented 1 year ago

Thanks for the issue @digrigor ! Looking into it now.

bgyori commented 1 year ago

Hello @digrigor, we are the developers of Gilda and might be able to help with this. Gilda itself can be customized fairly easily with extended/custom ontologies. Here is a notebook showing examples of different customizations: https://github.com/indralab/gilda/blob/master/notebooks/custom_grounders.ipynb. The key would be to allow instantiating such custom grounders within OntoGPT. If we're able to add more custom configuration to this line in OntoGPT: https://github.com/monarch-initiative/ontogpt/blob/4a2b8b7b8bb9285970e8212f40121729061e49f0/src/ontogpt/engines/knowledge_engine.py#L529, and the corresponding get_adapter function in OAK, that should make it possible to create custom instances.

As an alternative hack, you could manually replace Gilda's ~/.data/gilda/<version>/grounding_terms.tsv.gz file with custom terms that you want to use and that would just be used as if it were the default. This isn't a nice long term solution though.

caufieldjh commented 1 year ago

Hi @bgyori - yes, that's definitely one way to do it, and since @digrigor already has a gene list then it does sound like a case for Gilda. I believe adding this would require passing the config details to the GildaImplementation in OAK. That isn't implemented yet but I'll open an issue.

The other more immediate workaround may be to represent the term list as OWL and then provide the annotator name as path/to/the_list.owl. Protege can load a CSV and export as OWL, or it can be hacked together in JSON and converted with ROBOT, etc. Is it kind of excessive? Probably, since the crucial detail for OntoGPT is the labels rather than the relationships between classes. This suggests that we could add a very basic annotator for string matching or regex-based search alone.