monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
576 stars 72 forks source link

unable to extract HPO entities #205

Closed rebeccaito closed 12 months ago

rebeccaito commented 1 year ago

Many thanks for creating ontogpt. Is it correct that if I use the mendelian_disease.Symptom template, I should be able to extract HPO phenotype terms from text? I attempted this by creating a file phenotagger_example.txt which contains the test example found here:

The clinical features of Angelman syndrome (AS) comprise severe mental retardation, postnatal microcephaly, macrostomia and prognathia, absence of speech, ataxia, and a happy disposition. We report on seven patients who lack most of these features, but presented with obesity, muscular hypotonia and mild mental retardation. Based on the latter findings, the patients were initially suspected of having Prader-Willi syndrome. DNA methylation analysis of SNRPN and D15S63, however, revealed an AS pattern, ie the maternal band was faint or absent. Cytogenetic studies and microsatellite analysis demonstrated apparently normal chromosomes 15 of biparental inheritance. We conclude that these patients have an imprinting defect and a previously unrecognised form of AS. The mild phenotype may be explained by an incomplete imprinting defect or by cellular mosaicism.

I then called ontogpt with:

ontogpt extract -t mendelian_disease.Symptom -i phenotagger_example.txt --show-prompt

And got the following output:

input_text:                                        
  The clinical features of Angelman syndrome (AS) comprise severe mental retardation, postnatal microcephaly, macrostomia and prognathia, absence of speech, ataxia, and a happy disposition. We report on seven patients who lack most of these features, but presented with obesity, muscular hypotonia and mild mental retardation. Based on the latter findings, the patients were initially suspected of having Prader-Willi syndrome. DNA methylation analysis of SNRPN and D15S63, however, revealed an AS pattern, ie the maternal band was faint or absent. Cytogenetic studies and microsatellite analysis demonstrated apparently normal chromosomes 15 of biparental inheritance. We conclude that these patients have an imprinting defect and a previously unrecognised form of AS. The mild phenotype may be explained by an incomplete imprinting defect or by cellular mosaicism.
raw_completion_output: |-
  characteristic: severe mental retardation, postnatal microcephaly, macrostomia, prognathia, absence of speech, ataxia, happy disposition
  affects: patients (the seven patients)
  severity: severe (for mental retardation)
  onset_of_symptom: postnatal
  label: Angelman syndrome
prompt: |+
  From the text below, extract the following entities in the following format:

  characteristic: <the value for characteristic>
  affects: <the value for affects>
  severity: <the value for severity>
  onset_of_symptom: <the value for onset_of_symptom>
  label: <The label (name) of the named thing>

  Text:
  The clinical features of Angelman syndrome (AS) comprise severe mental retardation, postnatal microcephaly, macrostomia and prognathia, absence of speech, ataxia, and a happy disposition. We report on seven patients who lack most of these features, but presented with obesity, muscular hypotonia and mild mental retardation. Based on the latter findings, the patients were initially suspected of having Prader-Willi syndrome. DNA methylation analysis of SNRPN and D15S63, however, revealed an AS pattern, ie the maternal band was faint or absent. Cytogenetic studies and microsatellite analysis demonstrated apparently normal chromosomes 15 of biparental inheritance. We conclude that these patients have an imprinting defect and a previously unrecognised form of AS. The mild phenotype may be explained by an incomplete imprinting defect or by cellular mosaicism.

  ===

extracted_object:
  id: ca264bce-1003-4899-af85-e00eb5d8fca2
  label: Angelman syndrome
  characteristic: severe mental retardation, postnatal microcephaly, macrostomia,
    prognathia, absence of speech, ataxia, happy disposition
  affects: patients (the seven patients)
  severity: severe (for mental retardation)
  onset_of_symptom: AUTO:postnatal
named_entities:
  - id: AUTO:postnatal
    label: postnatal

I was expecting output that would create an extracted object for each phenotype term. For example:

extracted_objects:
1. 
  id: HP:0001249
  label: Intellectual disability
2. 
  id: HP:0000252
  label: microcephaly
3. 
  id: HP:0000154
  label: wide mouth
4. ...

Am I using ontogpt correctly? Is there a better way to use this tool to extract phenotype terms from text? Thanks.

rebeccaito commented 12 months ago

It looks like using model -t mendelian_disease.MendelianDisease instead of -t mendelian_disease.Symptom will abstract HPO terms as expected. Are there plans to create more detailed documentation for different use-cases and for the provided templates? Many thanks.

caufieldjh commented 12 months ago

Hi @rebeccaito, and thanks for your patience! You're absolutely correct that using the MendelianDisease class from this template should deliver results more closely aligned with your expectations, and this reveals part of how OntoGPT parses the template. In this case, specifying the Symptom class alone means it's only using this class:

  Symptom:
    is_a: NamedEntity
    id_prefixes:
      - HP
    annotations:
      annotators: sqlite:obo:hp, sqlite:obo:mondo
    attributes:
      characteristic:
      affects:
      severity:
      onset_of_symptom:
        range: Onset

and that's probably fine for modeling data, but it's missing the examples and multivalued property from the parent MendelianDisease class so OntoGPT doesn't generate a prompt explicitly stating you're looking for symptoms and it doesn't expect to find more than one of them.

We certainly do need more detailed documentation about this and about modifying/creating templates. Expect it soon!