Closed rickbeeloo closed 11 months ago
Hi @rickbeeloo, Glad to hear that OntoGPT may suit your needs! We don't currently have a way to pass an arbitrary set of ontologies directly on the command line, and that's partially because even two people using the same set of ontologies may have completely different use cases for them. Some small modifications to one of the existing schema will help, though. Here's an example:
id: http://w3id.org/ontogpt/rickbeeloo
name: rickbeeloo
title: rickbeelooTemplate
description: >-
A template for extracting ENVO, NCBITAXON, and FOODON
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
linkml: https://w3id.org/linkml/
rickbeloo: http://w3id.org/ontogpt/rickbeeloo
default_prefix: rickbeeloo
default_range: string
imports:
- linkml:types
- core
classes:
EntityContainingDocument:
tree_root: true
is_a: NamedEntity
attributes:
environments:
range: Environment
multivalued: true
description: >-
A semicolon-separated list of environmental terms.
taxa:
range: Taxon
multivalued: true
description: >-
A semicolon-separated list of taxonomic terms of living things.
foods:
range: Food
multivalued: true
description: >-
A semicolon-separated list of foods.
Environment:
is_a: NamedEntity
id_prefixes:
- ENVO
annotations:
annotators: sqlite:obo:envo
prompt: >-
the name of an environment.
Examples are lake, meadow, waterfall.
Taxon:
is_a: NamedEntity
id_prefixes:
- NCBITAXON
annotations:
annotators: sqlite:obo:ncbitaxon
prompt: >-
the name of a taxonomic name or species.
Examples are Bacillus subtilus, Bos taurus, blue whale.
Food:
is_a: NamedEntity
id_prefixes:
- FOODON
annotations:
annotators: sqlite:obo:foodon
prompt: >-
the name of a food, beverage, or ingredient.
Examples are ketchup, milkshake, grape jelly.
OK, so that's a full template, but feel free to adapt to your use case.
Thanks for the fast reply! Have a deadline tomorrow for which it would be nice to have something annotated with ontoGPT :)
I stored this as rickbeeloo.yaml
in the templates folder, then ran gen-pydantic --pydantic_version 2 rickbeeloo.yaml > rickbeeloo.py
. There are some warnings but it seems fine. Then I tried to run ontogpt extract -i example.txt -t rickbeeloo
which gave me the following error:
INFO:root:API KEY path = C:\Users\rbeel\AppData\Local\ontology-access-kit\ontology-access-kit\ncbi-email-apikey.txt
INFO:root:Email for NCBI API not found.
INFO:root:API KEY path = C:\Users\rbeel\AppData\Local\ontology-access-kit\ontology-access-kit\ncbi-key-apikey.txt
INFO:root:NCBI API key not found. Will use no key.
WARNING:rdflib.term:C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\linkml_runtime\linkml_model\model\schema\types does not look like a valid URI, trying to serialize this will break.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts\ontogpt.exe\__main__.py", line 7, in <module>
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\cli.py", line 316, in extract
ke = SPIRESEngine(template=template, model=model_name, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<string>", line 23, in __init__
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 155, in __post_init__
self.template_class = self._get_template_class(self.template)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 247, in _get_template_class
mod = importlib.import_module(f"ontogpt.templates.{module_name}")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1776.0_x64__qbz5n2kfra8p0\Lib\importlib\__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 936, in exec_module
File "<frozen importlib._bootstrap_external>", line 1074, in get_code
File "<frozen importlib._bootstrap_external>", line 1004, in source_to_code
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
SyntaxError: source code string cannot contain null bytes
Quite a cryptic error for me but hope you know.
I attatched the zip and py (in case that would help) rickbeeloo.zip
Aaah, apparently it generated the rickbeeloo.py
encoded with utf-16
opening the .py in an editor and saving it as utf-8
instead solved it
OK great!
It worked for one text but not for this one, mouse gut metagenome
, any idea why?
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\cli.py", line 336, in extract
results = ke.extract_from_text(text=text, cls=target_class_def, show_prompt=show_prompt)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\spires_engine.py", line 97, in extract_from_text
extracted_object = self.parse_completion_payload(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\spires_engine.py", line 559, in parse_completion_payload
return self.ground_annotation_object(raw, cls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\spires_engine.py", line 631, in ground_annotation_object
obj = self.normalize_named_entity(val, slot.range) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 352, in normalize_named_entity
for normalized_id in self.normalize_identifier(obj_id, cls):
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 416, in normalize_identifier
for obj_id in self.map_identifier(input_id, cls):
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 449, in map_identifier
for mapping in mapper.sssom_mappings([input_id]):
File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\oaklib\implementations\translator\translator_implementation.py", line 62, in sssom_mappings
equiv_identifiers = data.get("equivalent_identifiers", [])
^^^^^^^^
AttributeError: 'str' object has no attribute 'get'
Apparently it happens here in oak.
That bug is something that has been fixed in oaklib
but just hasn't made it into the most recent release.
If you open translator_implementation.py
as specified in the last part of your error stack trace, you can make the fix by changing the value of NODE_NORMALIZER_ENDPOINT
as is done here: https://github.com/INCATools/ontology-access-kit/commit/75940bfa883001afb0e4aebb339fd62583c13844
The newest oak release (0.5.21) should contain the fix, too.
Thanks! that fixed it (had to downgrade pydactic again as well)
For the input text This is a mouse gut sample
get:
input_text: |-
This is a mouse gut sample
raw_completion_output: |-
environments: N/A
taxa: mouse
foods: gut
label: mouse gut
prompt: |+
From the text below, extract the following entities in the following format:
environments: <A semicolon-separated list of environmental terms.>
taxa: <A semicolon-separated list of taxonomic terms of living things.>
foods: <A semicolon-separated list of foods.>
label: <The label (name) of the named thing>
Text:
This is a mouse gut sample
===
extracted_object:
id: cb699559-d389-4b43-915f-1ae19f750bf2
label: mouse gut
environments:
- AUTO:N/A
taxa:
- AUTO:mouse
foods:
- AUTO:gut
named_entities:
- id: AUTO:N/A
label: N/A
- id: AUTO:mouse
label: mouse
- id: AUTO:gut
label: gut
But we would expect mouse to be linked to "Mus musculus" and thereby to the NCBITAXO onto entry but this is not happening? Also does it see "gut" as food?
Aha, that's my mistake - the prefix in that case should be NCBITaxon
, not NCBITAXON
.
So if you edit the template's id_prefixes:
slot for Taxon
to be NCBITaxon
instead of NCBITAXON
, then you should get NCBITaxon:10088 instead.
As for the rest of it, I actually got "mouse gut" as an environment when running this extraction, so there's some variability. A longer input text would likely provide the context the LLM can use to better categorize the entity.
Hey @caufieldjh, it does work fine for those now, thanks! I want to add UBERON and was reading your replies here, I suppose it is a simple as:
classes:
EntityContainingDocument:
tree_root: true
is_a: NamedEntity
attributes:
........
anatomies:
range: Anatomy
multivalued: true
description: >-
A semiclon-separated list of the body structures of living things
and
Anatomy:
is_a: NamedEntity
id_prefixes:
- UBERON
annotations:
annotators: sqlite:obo:uberon
prompt: >-
the name of an anatomical structure.
Examples are blood, arm, haemolymphatic fluid.
When I then run it for: sample from human blood and arm
It gives me:
extracted_object:
id: a272ba36-66e1-405b-9d15-e9088d563caf
label: human blood, arm
taxa:
- NCBITaxon:9606
anatomies:
- AUTO:blood%2C%20arm
named_entities:
- id: NCBITaxon:9606
label: human
- id: AUTO:blood%2C%20arm
label: blood, arm
I again get the "AUTO" while we have a term for blood in UBERON. So perhaps messed something up again.
Thanks again for building all this, really excited to run this for half a million metadata fields in our next paper!
ooow damn my browser spelling check saw: "semiclon-separated" instead of "semicolon-separated. That solved it... well that was a fun day for one letter haha. Thanks anyway :)
Fantastic - happy to hear it's working!
Hey @caufieldjh!
Awesome idea! This is probably exactly what we need (I hope :)). We have a bunch of free text fields that we want to standardize by mapping them to ontologies (ENVO, NCBITAXON, and FOODON). Reading the docs I suppose this should be possible, but I'm a bit lost in the docs.
Lets take this as example, in
test.txt
:This is a sample from drinking water with E.coli
Which should give me something along the lines of:
I wonder what command should we run to do the above? I expected to be able to run something like
ontogpt --inputfile test.txt -t ENVO,NCBITAXON,FOODON
but looking here I should probably use one of the presets?Thanks!