monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
548 stars 68 forks source link

Remove redundant calls to normalize cached objects #402

Open caufieldjh opened 1 week ago

caufieldjh commented 1 week ago

Re: comments from @leokim-l on Slack, and my own suspicions that the oaklib normalization functions were being called too often, I tried to find out how much impact connection speed/availability had on entity grounding performance.

$ echo "nucleus" > nucleus.txt
$ poetry shell
$ time ontogpt -vvv extract -i nucleus.txt -m gpt-4o -t go_simple
...
real    0m7.334s
user    0m4.230s
sys     0m1.488s

So that's uncached - running the same again should be with the cache

$ time ontogpt -vvv extract -i nucleus.txt -m gpt-4o -t go_simple
...
real    0m4.085s
user    0m3.859s
sys     0m1.307s

OK, so it's faster - but I'm on a pretty decent wired connection here. After manually limiting my ethernet speed to 10 Mbps:

real    0m5.963s
user    0m5.502s
sys     0m1.617s

That could be coincidental, but after restoring the full speed it's back in the 4s range. So it looks like a call is still being made somewhere. but... if I have no connection at all, no errors are raised, the grounding process just doesn't happen (so if I wanted to map to some other namespace, that process wouldn't work).

Full debug output looks like this:

$ time ontogpt -vvv extract -i nucleus.txt -m gpt-4o -t go_simple
WARNING:ontogpt.clients:llm_gpt4all module not found. GPT4All support will be disabled.
WARNING:ontogpt.engines.knowledge_engine:GPT4All client not available. GPT4All support will be disabled.
INFO:root:Logger root set to level 10
INFO:root:Found model: MODEL_GPT_4o, provided by OpenAI.
INFO:root:Input file: nucleus.txt
INFO:root:Input text: nucleus

INFO:ontogpt.io.template_loader:Loading schema for go_simple
INFO:root:Importing core as core from source /home/harry/ontogpt/src/ontogpt/templates/go_simple.yaml; base_dir=/home/harry/ontogpt/src/ontogpt/templates
INFO:root:Importing linkml:types as /home/harry/ontogpt/.venv/lib/python3.10/site-packages/linkml_runtime/linkml_model/model/schema/types from source /home/harry/ontogpt/src/ontogpt/templates/go_simple.yaml; base_dir=None
INFO:ontogpt.io.template_loader:Getting class for template go_simple
INFO:root:Using template OntologyTermSet
INFO:root:Using mappers (currently hardcoded)
INFO:root:Model source is openai
INFO:root:API KEY path = /home/harry/.config/ontology-access-kit/openai-apikey.txt
DEBUG:httpx:load_ssl_context verify=True cert=None trust_env=True http2=False
DEBUG:httpx:load_verify_locations cafile='/home/harry/ontogpt/.venv/lib/python3.10/site-packages/certifi/cacert.pem'
INFO:root:Setting up OpenAI client API Key
INFO:root:API KEY path = /home/harry/.config/ontology-access-kit/openai-apikey.txt
INFO:root:Will use this client: <class 'ontogpt.clients.openai_client.OpenAIClient'>
INFO:ontogpt.clients.openai_client:Complete: engine=gpt-4o, prompt[210]=From the text below, extract the following entities in the following format:

terms: <A semicolon-se...
INFO:ontogpt.clients.openai_client:Caching OpenAI responses to /home/harry/ontogpt/.openai_cache.db
INFO:ontogpt.clients.openai_client:Using cached payload for prompt: From the text below, extract the following entities in the following format:\n\nte...
INFO:root:RAW TEXT: terms: GO:0005634
label: nucleus
INFO:root:PARSING LINE: terms: GO:0005634
DEBUG:root:  FIELD: terms
DEBUG:root:SLOT: terms INL: False VALS: ['GO:0005634']
INFO:root:PARSING LINE: label: nucleus
DEBUG:root:  FIELD: label
DEBUG:root:SLOT: label INL: None VALS: ['nucleus']
DEBUG:root:RAW: {'terms': ['GO:0005634'], 'label': 'nucleus'}
DEBUG:root:Grounding annotation object {'terms': ['GO:0005634'], 'label': 'nucleus', 'id': 'c0a98f49-5cb6-4d9c-8535-85b8918ac042'}
DEBUG:root:FIELD: terms SLOT: terms
DEBUG:root:   VAL: GO:0005634
INFO:ontogpt.engines.knowledge_engine:GROUNDING GO:0005634 using OntologyTerm
INFO:ontogpt.engines.knowledge_engine:Grounding GO:0005634 to GO:0005634; next step is to normalize
INFO:ontogpt.engines.knowledge_engine:Normalized GO:0005634 with GO:0005634 to GO:0005634
DEBUG:root:FIELD: label SLOT: label
DEBUG:root:   VAL: nucleus
DEBUG:root:FIELD: id SLOT: id
DEBUG:root:   VAL: c0a98f49-5cb6-4d9c-8535-85b8918ac042
DEBUG:root:Creating object from dict {'terms': ['GO:0005634'], 'label': 'nucleus', 'id': 'c0a98f49-5cb6-4d9c-8535-85b8918ac042'}
INFO:root:{'terms': ['GO:0005634'], 'label': 'nucleus', 'id': 'c0a98f49-5cb6-4d9c-8535-85b8918ac042'}
---
input_text: |
  nucleus
raw_completion_output: |-
  terms: GO:0005634
  label: nucleus
prompt: |+
  From the text below, extract the following entities in the following format:

  terms: <A semicolon-separated list of any Gene Ontology terms.>
  label: <The label (name) of the named thing>

  Text:
  nucleus

  ===

extracted_object:
  id: c0a98f49-5cb6-4d9c-8535-85b8918ac042
  label: nucleus
  terms:
    - GO:0005634
named_entities:
  - id: GO:0005634
    label: GO:0005634

The issue is cases like this:

INFO:ontogpt.engines.knowledge_engine:GROUNDING GO:0005634 using OntologyTerm
INFO:ontogpt.engines.knowledge_engine:Grounding GO:0005634 to GO:0005634; next step is to normalize
INFO:ontogpt.engines.knowledge_engine:Normalized GO:0005634 with GO:0005634 to GO:0005634