monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
603 stars 75 forks source link

Some guidance :) #255

Closed rickbeeloo closed 11 months ago

rickbeeloo commented 1 year ago

Hey @caufieldjh!

Awesome idea! This is probably exactly what we need (I hope :)). We have a bunch of free text fields that we want to standardize by mapping them to ontologies (ENVO, NCBITAXON, and FOODON). Reading the docs I suppose this should be possible, but I'm a bit lost in the docs.

Lets take this as example, in test.txt: This is a sample from drinking water with E.coli

Which should give me something along the lines of:

drinking water: ENVO:00003064
E.coli: NCBITAXON:562

I wonder what command should we run to do the above? I expected to be able to run something like ontogpt --inputfile test.txt -t ENVO,NCBITAXON,FOODON but looking here I should probably use one of the presets?

Thanks!

caufieldjh commented 1 year ago

Hi @rickbeeloo, Glad to hear that OntoGPT may suit your needs! We don't currently have a way to pass an arbitrary set of ontologies directly on the command line, and that's partially because even two people using the same set of ontologies may have completely different use cases for them. Some small modifications to one of the existing schema will help, though. Here's an example:

id: http://w3id.org/ontogpt/rickbeeloo
name: rickbeeloo
title: rickbeelooTemplate
description: >-
  A template for extracting ENVO, NCBITAXON, and FOODON
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  linkml: https://w3id.org/linkml/
  rickbeloo: http://w3id.org/ontogpt/rickbeeloo

default_prefix: rickbeeloo
default_range: string

imports:
  - linkml:types
  - core

classes:
  EntityContainingDocument:
    tree_root: true
    is_a: NamedEntity
    attributes:
      environments:
        range: Environment
        multivalued: true
        description: >- 
          A semicolon-separated list of environmental terms.
      taxa:
        range: Taxon
        multivalued: true
        description: >- 
          A semicolon-separated list of taxonomic terms of living things.
      foods:
        range: Food
        multivalued: true
        description: >- 
          A semicolon-separated list of foods.

  Environment:
    is_a: NamedEntity
    id_prefixes:
      - ENVO
    annotations:
      annotators: sqlite:obo:envo
      prompt: >- 
        the name of an environment.
         Examples are lake, meadow, waterfall.

  Taxon:
    is_a: NamedEntity
    id_prefixes:
      - NCBITAXON
    annotations:
      annotators: sqlite:obo:ncbitaxon
      prompt: >- 
        the name of a taxonomic name or species.
         Examples are Bacillus subtilus, Bos taurus, blue whale.

  Food:
    is_a: NamedEntity
    id_prefixes:
      - FOODON
    annotations:
      annotators: sqlite:obo:foodon
      prompt: >- 
        the name of a food, beverage, or ingredient.
         Examples are ketchup, milkshake, grape jelly.

OK, so that's a full template, but feel free to adapt to your use case.

rickbeeloo commented 1 year ago

Thanks for the fast reply! Have a deadline tomorrow for which it would be nice to have something annotated with ontoGPT :)

I stored this as rickbeeloo.yaml in the templates folder, then ran gen-pydantic --pydantic_version 2 rickbeeloo.yaml > rickbeeloo.py. There are some warnings but it seems fine. Then I tried to run ontogpt extract -i example.txt -t rickbeeloo which gave me the following error:

INFO:root:API KEY path = C:\Users\rbeel\AppData\Local\ontology-access-kit\ontology-access-kit\ncbi-email-apikey.txt
INFO:root:Email for NCBI API not found.
INFO:root:API KEY path = C:\Users\rbeel\AppData\Local\ontology-access-kit\ontology-access-kit\ncbi-key-apikey.txt
INFO:root:NCBI API key not found. Will use no key.
WARNING:rdflib.term:C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\linkml_runtime\linkml_model\model\schema\types does not look like a valid URI, trying to serialize this will break.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts\ontogpt.exe\__main__.py", line 7, in <module>
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\cli.py", line 316, in extract
    ke = SPIRESEngine(template=template, model=model_name, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 23, in __init__
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 155, in __post_init__
    self.template_class = self._get_template_class(self.template)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 247, in _get_template_class
    mod = importlib.import_module(f"ontogpt.templates.{module_name}")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1776.0_x64__qbz5n2kfra8p0\Lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 936, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1074, in get_code
  File "<frozen importlib._bootstrap_external>", line 1004, in source_to_code
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
SyntaxError: source code string cannot contain null bytes

Quite a cryptic error for me but hope you know.

I attatched the zip and py (in case that would help) rickbeeloo.zip

rickbeeloo commented 1 year ago

Aaah, apparently it generated the rickbeeloo.py encoded with utf-16 opening the .py in an editor and saving it as utf-8 instead solved it

caufieldjh commented 1 year ago

OK great!

rickbeeloo commented 1 year ago

It worked for one text but not for this one, mouse gut metagenome, any idea why?

    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\cli.py", line 336, in extract
    results = ke.extract_from_text(text=text, cls=target_class_def, show_prompt=show_prompt)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\spires_engine.py", line 97, in extract_from_text
    extracted_object = self.parse_completion_payload(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\spires_engine.py", line 559, in parse_completion_payload
    return self.ground_annotation_object(raw, cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\spires_engine.py", line 631, in ground_annotation_object
    obj = self.normalize_named_entity(val, slot.range)  # type: ignore
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 352, in normalize_named_entity
    for normalized_id in self.normalize_identifier(obj_id, cls):
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 416, in normalize_identifier
    for obj_id in self.map_identifier(input_id, cls):
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\ontogpt\engines\knowledge_engine.py", line 449, in map_identifier
    for mapping in mapper.sssom_mappings([input_id]):
  File "C:\Users\rbeel\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\oaklib\implementations\translator\translator_implementation.py", line 62, in sssom_mappings
    equiv_identifiers = data.get("equivalent_identifiers", [])
                        ^^^^^^^^
AttributeError: 'str' object has no attribute 'get'

Apparently it happens here in oak.

caufieldjh commented 1 year ago

That bug is something that has been fixed in oaklib but just hasn't made it into the most recent release. If you open translator_implementation.py as specified in the last part of your error stack trace, you can make the fix by changing the value of NODE_NORMALIZER_ENDPOINT as is done here: https://github.com/INCATools/ontology-access-kit/commit/75940bfa883001afb0e4aebb339fd62583c13844

caufieldjh commented 1 year ago

The newest oak release (0.5.21) should contain the fix, too.

rickbeeloo commented 1 year ago

Thanks! that fixed it (had to downgrade pydactic again as well)

For the input text This is a mouse gut sample get:

input_text: |-
  This is a mouse gut sample

raw_completion_output: |-
  environments: N/A
  taxa: mouse
  foods: gut
  label: mouse gut
prompt: |+
  From the text below, extract the following entities in the following format:

  environments: <A semicolon-separated list of environmental terms.>
  taxa: <A semicolon-separated list of taxonomic terms of living things.>
  foods: <A semicolon-separated list of foods.>
  label: <The label (name) of the named thing>

  Text:
  This is a mouse gut sample

  ===

extracted_object:
  id: cb699559-d389-4b43-915f-1ae19f750bf2
  label: mouse gut
  environments:
    - AUTO:N/A
  taxa:
    - AUTO:mouse
  foods:
    - AUTO:gut
named_entities:
  - id: AUTO:N/A
    label: N/A
  - id: AUTO:mouse
    label: mouse
  - id: AUTO:gut
    label: gut

But we would expect mouse to be linked to "Mus musculus" and thereby to the NCBITAXO onto entry but this is not happening? Also does it see "gut" as food?

caufieldjh commented 1 year ago

Aha, that's my mistake - the prefix in that case should be NCBITaxon, not NCBITAXON. So if you edit the template's id_prefixes: slot for Taxon to be NCBITaxon instead of NCBITAXON, then you should get NCBITaxon:10088 instead.

As for the rest of it, I actually got "mouse gut" as an environment when running this extraction, so there's some variability. A longer input text would likely provide the context the LLM can use to better categorize the entity.

rickbeeloo commented 11 months ago

Hey @caufieldjh, it does work fine for those now, thanks! I want to add UBERON and was reading your replies here, I suppose it is a simple as:

classes:
  EntityContainingDocument:
    tree_root: true
    is_a: NamedEntity
    attributes:
      ........
      anatomies:
        range: Anatomy
        multivalued: true 
        description: >-
          A semiclon-separated list of the body structures of living things

and

Anatomy:
    is_a: NamedEntity
    id_prefixes:
       - UBERON
    annotations:
      annotators: sqlite:obo:uberon
      prompt: >-
         the name of an anatomical structure. 
          Examples are blood, arm, haemolymphatic fluid.

When I then run it for: sample from human blood and arm It gives me:

extracted_object:
  id: a272ba36-66e1-405b-9d15-e9088d563caf
  label: human blood, arm
  taxa:
    - NCBITaxon:9606
  anatomies:
    - AUTO:blood%2C%20arm
named_entities:
  - id: NCBITaxon:9606
    label: human
  - id: AUTO:blood%2C%20arm
    label: blood, arm

I again get the "AUTO" while we have a term for blood in UBERON. So perhaps messed something up again.

Thanks again for building all this, really excited to run this for half a million metadata fields in our next paper!

rickbeeloo commented 11 months ago

ooow damn my browser spelling check saw: "semiclon-separated" instead of "semicolon-separated. That solved it... well that was a fun day for one letter haha. Thanks anyway :)

caufieldjh commented 11 months ago

Fantastic - happy to hear it's working!