monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
603 stars 75 forks source link

issue with parsing while using `pubmed-extract` #206

Open rebeccaito opened 1 year ago

rebeccaito commented 1 year ago

pubmed-extract appears to create a prompt for each character in the title/abstract when provided a Pubmed ID. For example, when providing PMID:37666660 ("Biallelic truncating variants in VGLL2 cause syngnathia in humans") with the following command:

ontogpt pubmed-extract -t mendelian_disease.Symptom 37666660

The following output is generated (truncated here because it's quite long):


input_text: T
raw_completion_output: |-
  characteristic: unknown
  affects: the named thing
  severity: unknown
  onset_of_symptom: unknown
  label: T
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  T

  ===

extracted_object:
  id: 0e9680b9-b01f-40c1-89b6-3bf9b5422514
  label: T
  characteristic: unknown
  affects: the named thing
  severity: unknown
  onset_of_symptom: unknown
input_text: i
raw_completion_output: |-
  characteristic: 
  affects: 
  severity: 
  onset_of_symptom: 
  label: i
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  i

  ===

extracted_object:
  id: b651c9ac-f9d7-4bf7-a127-60af4473e404
  label: i
input_text: t
raw_completion_output: |-
  characteristic: None
  affects: None
  severity: None
  onset_of_symptom: None
  label: None
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  t

  ===

extracted_object:
  id: 24576c17-1492-4af3-a373-9db116da60f5
  label: None
  characteristic: None
  affects: None
  severity: None
  onset_of_symptom: None
input_text: l
raw_completion_output: |-
  characteristic: 
  affects: 
  severity: 
  onset_of_symptom: 
  label:
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  l

  ===

extracted_object:
  id: 23c09c09-bbe6-4bab-a01d-3065f64beb48
input_text: e
raw_completion_output: |-
  characteristic: N/A
  affects: N/A
  severity: N/A
  onset_of_symptom: N/A
  label: e
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  e

  ===

extracted_object:
  id: d99bd59f-5d69-470e-a1bf-2df8064775d6
  label: e
  characteristic: N/A
  affects: N/A
  severity: N/A
  onset_of_symptom: N/A
input_text: ':'
raw_completion_output: |-
  characteristic: None
  affects: None
  severity: None
  onset_of_symptom: None
  label: None
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  :

  ===

extracted_object:
  id: c2586c4b-0d49-403a-9a80-9dfd7e55a1af
  label: None
  characteristic: None
  affects: None
  severity: None
  onset_of_symptom: None
input_text: ' '
raw_completion_output: |-
  characteristic: 
  affects: 
  severity: 
  onset_of_symptom: 
  label:
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:

  ===

extracted_object:
  id: 0405b469-ce58-4e58-9fbd-42fdddf91a32
input_text: B
raw_completion_output: 'label: B'
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  B

  ===

extracted_object:
  id: 41e68f16-56bb-4e1a-aadc-36c213aee383
  label: B
input_text: i
raw_completion_output: |-
  characteristic: 
  affects: 
  severity: 
  onset_of_symptom: 
  label: i
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  i

  ===

extracted_object:
  id: 06a46f88-6240-4632-b26b-570291d958b0
  label: i
input_text: a
raw_completion_output: |-
  characteristic: 
  affects: 
  severity: 
  onset_of_symptom: 
  label: a
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  a

  ===

extracted_object:
  id: bc94e4af-3911-4160-bbbf-f3263e817d7d
  label: a
input_text: l
raw_completion_output: |-
  characteristic: 
  affects: 
  severity: 
  onset_of_symptom: 
  label:
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  l

  ===

extracted_object:
  id: ca8a2845-3a2c-4eab-a9e1-e240bff174d5
input_text: l
raw_completion_output: |-
  characteristic: 
  affects: 
  severity: 
  onset_of_symptom: 
  label:
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  l

  ===

extracted_object:
  id: 3ee1fc11-0749-42fb-888d-8a72c6c370c9
input_text: e
raw_completion_output: |-
  characteristic: N/A
  affects: N/A
  severity: N/A
  onset_of_symptom: N/A
  label: e
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  e

  ===

extracted_object:
  id: 34808115-ff0a-40e1-82d1-732e54b26b78
  label: e
  characteristic: N/A
  affects: N/A
  severity: N/A
  onset_of_symptom: N/A
input_text: l
raw_completion_output: |-
  characteristic: 
  affects: 
  severity: 
  onset_of_symptom: 
  label:
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  l

  ===

extracted_object:
  id: 12de0380-1f41-4ed4-8321-ae121dd144a0
input_text: i
raw_completion_output: |-
  characteristic: 
  affects: 
  severity: 
  onset_of_symptom: 
  label: i
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  i

  ===

extracted_object:
  id: f636f898-e01b-4e9a-b4ab-2dea0b91fbd6
  label: i
input_text: c
raw_completion_output: |-
  characteristic: Unknown
  affects: Unknown
  severity: Unknown
  onset_of_symptom: Unknown
  label: Unknown
prompt: |+
  Split the following piece of text into fields in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  c

  ===

As you can see the Text fields spell Title: Biallelic... when you stitch it back together, so the correct Pubmed ID is being accessed, although it appears to be parsed incorrectly before prompt generation.

Is there a preferred way to extract information from Pubmed articles? Thanks.

caufieldjh commented 1 year ago

Hi @rebeccaito - try the pubmed-annotate function instead. The inputs are handled a bit differently as it can accept a full set of PubMed search results, but these can also be PMIDs. This should work, for example:

$ontogpt pubmed-annotate -t mendelian_disease.MendelianDisease --limit 1 37666660

It will perform the extraction on this publication's title+abstract alone.