monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
606 stars 76 forks source link

Parse numbered lists correctly #133

Closed cmungall closed 5 months ago

cmungall commented 1 year ago

gpt-3.5-turbo-16k has 16k context windows - good for enrichment much less need to truncate.

I theory this can be dropped in --gpt-3.5-turbo-16k

but I have noticed the newer models seem less inclined to follow instructions and give ; separated lists. Seen this @caufieldjh? It likes to give numbered lists

eg

  Summary: The common function among these genes is the regulation of various cellular processes and signaling pathways.

  Hypothesis:  The enriched terms suggest that these genes are involved in molecular interactions and signaling networks, which are essential for numerous cellular processes and regulation of various pathways. The overlapping functions may point towards their involvement in common regulatory mechanisms and networks that contribute to cellular homeostasis and development.
term_strings:
  - |-
    1. protein binding
    2. enzyme binding
    3. dna-binding transcription factor activity
    4. rna polymerase ii-specific transcription factor activity
    5. receptor binding activity
    6. atp binding activity
    7. cytoskeletal protein binding activity
    8. growth factor activity
    9. carbohydrate binding activity
    10. heme binding activity

    mechanism: these genes play a role in cellular processes such as protein-protein interactions
  - enzymatic activities
  - transcriptional regulation
  - receptor signaling
  - and binding of various molecules. they are involved in multiple signaling pathways
  - including growth factor signaling
  - dna transcription
  - cellular metabolism

it should be easy to modify the hacky payload parser to accept numbered lists

or maybe we just bite the bullet and use the json-structured function call reponses

caufieldjh commented 10 months ago

Changing issue title as using newer models has been addressed. Still need to adjust parser to accept lists in formats other than semicolon-delimited (e.g., if the LLM doesn't follow directions)