monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
578 stars 72 forks source link

HTTP Error When Downloading Ontology Database #346

Closed yingxuepanyaleedu closed 6 months ago

yingxuepanyaleedu commented 6 months ago

Hi @caufieldjh, I encountered a bug with my custom schema. I am trying to use SNOMEDCT as my annotator and when I run the extract command with my template and input text, I get the following error. I have checked that my bioportal API key is set correctly and the database URL SNOMED CT seems correct too. I have no idea how to resolve this error. Can you please provide some guidance? I would really appreciate any help.

WARNING:rdflib.term:c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\linkml_runtime\linkml_model\model\schema\types does not look like a valid URI, trying to serialize this will break.

Downloading SNOMEDCT.db.gz: 0.00B [00:00, ?B/s]

Traceback (most recent call last):
  File "C:\Users\yingx\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\yingx\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\yingx\Capstone2023\individual_capstone\Scripts\ontogpt.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\ontogpt\cli.py", line 348, in extract
    results = ke.extract_from_text(
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\ontogpt\engines\spires_engine.py", line 95, in extract_from_text
    extracted_object = self.parse_completion_payload(
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\ontogpt\engines\spires_engine.py", line 558, in parse_completion_payload
    return self.ground_annotation_object(raw, cls)
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\ontogpt\engines\spires_engine.py", line 630, in ground_annotation_object
    obj = self.normalize_named_entity(val, slot.range)  # type: ignore
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\ontogpt\engines\knowledge_engine.py", line 341, in normalize_named_entity
    for normalized_id in self.normalize_identifier(obj_id, cls):
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\ontogpt\engines\knowledge_engine.py", line 403, in normalize_identifier
    if self.is_valid_identifier(input_id, cls):
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\ontogpt\engines\knowledge_engine.py", line 390, in is_valid_identifier
    valid_ids = [pv.text for pv in pvs]
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\ontogpt\engines\knowledge_engine.py", line 390, in <listcomp>
    valid_ids = [pv.text for pv in pvs]
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\oaklib\utilities\subsets\value_set_expander.py", line 110, in expand_value_set
    oi = self._get_handle(rq.source_ontology)
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\oaklib\utilities\subsets\value_set_expander.py", line 175, in _get_handle
    return get_adapter(shorthand)
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\oaklib\selector.py", line 152, in get_adapter
    return res.implementation_class(res, **kwargs)
  File "<string>", line 26, in __init__
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\oaklib\implementations\sqldb\sql_implementation.py", line 329, in __post_init__
    db_path = OAKLIB_MODULE.ensure_gunzip(url=url, autoclean=False)
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\pystow\impl.py", line 297, in ensure_gunzip
    path = self.ensure(
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\pystow\impl.py", line 171, in ensure
    utils.download(
  File "c:\users\yingx\capstone2023\individual_capstone\lib\site-packages\pystow\utils.py", line 360, in download
    urlretrieve(url, path, reporthook=t.update_to, **kwargs)  # noqa:S310
  File "C:\Users\yingx\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 239, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "C:\Users\yingx\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\yingx\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
    response = meth(req, response)
  File "C:\Users\yingx\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
    response = self.parent.error(
  File "C:\Users\yingx\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
    return self._call_chain(*args)
  File "C:\Users\yingx\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
    result = func(*args)
  File "C:\Users\yingx\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

My schema looks like this

id: http://w3id.org/ontogpt/condition
name: condition
title: condition FHIR Template
description: >-
  A FHIR-compliant template for conditions mentioned in a clinical note
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  linkml: https://w3id.org/linkml/
  condition: http://w3id.org/ontogpt/condition/
  SNOMEDCT: https://purl.bioontology.org/ontology/SNOMEDCT

default_prefix: UNKNOWN

imports:
  - linkml:types
  - core

classes:
  Condition:
    tree_root: true
    is_a: NamedEntity
    attributes:
      label:
        description: The concise name of the condition, problem, or diagnosis.
      clinical_status:
        description: The clinical status of the condition.
        range: ConditionClinicalStatus
        ifabsent: string("unknown")
        required: true
      severity:
        description: Subjective severity of condition.
        range: ConditionDiagnosisSeverity
      code:
        description: The condition, problem, or diagnosis found in this note.
        range: ConditionProblemDiagnosis
        required: true

  ConditionProblemDiagnosis:
    is_a: NamedEntity
    id_prefixes: 
      - SNOMEDCT
    annotations:
      annotators: bioportal:SNOMEDCT
    slot_usage:
      id:
        values_from: 
          - ConditionProblemDiagnosisIdentifier

enums:
  ConditionClinicalStatus:
    permissible_values:
      active:
        description: The subject is currently experiencing the condition or situation, there is evidence of the condition or situation, or considered to be a significant risk.
      recurrence:
        description: The subject is experiencing a reoccurence or repeating of a previously resolved condition or situation, e.g. urinary tract infection, food insecurity.
      relapse:
        description: The subject is experiencing a return of a condition or situation after a period of improvement or remission, e.g. relapse of cancer, alcoholism.
      inactive:
        description: The subject is no longer experiencing the condition or situation and there is no longer evidence or appreciable risk of the condition or situation.
      remission:
        description: The subject is not presently experiencing the condition or situation, but there is a risk of the condition or situation returning.
      resolved:
        description: The subject is not presently experiencing the condition or situation and there is a negligible perceived risk of the condition or situation returning.
      unknown:
        description: The authoring/source system does not know which of the status values currently applies for this condition.

  ConditionDiagnosisSeverity:
    permissible_values:
      severe:
        meaning: SNOMEDCT:24484000
      moderate:
        meaning: SNOMEDCT:6736007
      mild:
        meaning: SNOMEDCT:255604002

  ConditionProblemDiagnosisIdentifier:
    reachable_from:
      source_ontology: bioportal:SNOMEDCT
      source_nodes: 
        - SNOMEDCT:404684003 ## clinical finding

And here is one of my input text

Sample Type / Medical Specialty: Discharge Summary
Sample Name: Death Summary - 1
Description: Death summary of patient with advanced non-small cell lung carcinoma with left malignant pleural effusion status post chest tube insertion status post chemical pleurodesis.
(Medical Transcription Sample Report)
The patient pronounced expired at 01:40 hours.
DISCHARGE DIAGNOSES:
1. Advanced non-small cell lung carcinoma with left malignant pleural effusion status post chest tube insertion status post chemical pleurodesis.
2. Respiratory failure secondary to above.
3. Likely postobstructive pneumonia.
4. Gastrointestinal bleed.
5. Thrombocytopenia.
6. Acute renal failure.
7. Hyponatremia.
8. Hypercalcemia, likely secondary to paraneoplastic syndrome from the non-small cell lung CA, possible metastases to the bones.
9. Leukemoid reaction, likely secondary to malignancy.
10. Elevated liver function tests.
HOSPITAL COURSE: This is a 53-year-old African American male patient of Dr. X who was admitted through the emergency room. He has been having some right hip pain and cough. The patient had a CT scan of the chest, which revealed a left pleural effusion, extensive mediastinal mass, left hilar adenopathy, causing complete obstruction of the left lower lobe and the lingula and the left pulmonary vein, and the multiple nodules on the right side of his chest. These were all consistent with metastatic disease. He was thus also a suspicion for osseous metastatic disease involving the right scapula with a left large pleural effusion. The patient had severe shortness of breath, chest pain, a left-sided chest tube was inserted, and pleural effusion was positive for malignant cells. The history of right hip pain could be secondary to metastatic disease. The patient underwent bronchoscopy, which is positive for non-small cell lung CA. The patient was seen by various consultants. The patient underwent respiratory failure, requiring intubation, mechanical ventilatory support. He was extubated, but had to be re-intubated because of respiratory failure. Had a long discussion with the patient's wife and other family members. The patient was seen by Dr. Y. The patient was not in a condition to undergo any kind of chemotherapy, being on the ventilator. The patient progressively got deteriorated. The patient's family requested for DNR, withdrawal of the life support. The patient was extubated, and he was pronounced expired on 08/21/08 at 01:40 hours.
I appreciate all consultants' input.
Keywords:
discharge summary, dnr, pronounced expired, extubated, death summary, lung carcinoma, pleural effusion,
caufieldjh commented 6 months ago

Hi @yingxuepanyaleedu - I was able to reproduce this error. It looks like grounding happens part of the way (e.g., with your input, I saw that Elevated liver function tests gets grounded to SNOMEDCT:75540009, which just corresponds to the elevated part and not the rest of that phrase) but then uses an incorrect normalizer. Looking into it.

caufieldjh commented 6 months ago

Here's a workaround for now: comment out or remove the slot_usage section from ConditionProblemDiagnosis. The problem is caused by OntoGPT trying to expand the range of potential values defined in ConditionProblemDiagnosisIdentifier, but right now it doesn't know how to do that for ontologies from Bioportal. This likely isn't ideal as I see you just want SNOMEDCT terms from clinical finding, and I suspect there's still another way to get this result (it may involve a bugfix in the oaklib package).

You may also want to capture multiple conditions for a note - with the current schema, the extracted object will only include one. Modified example schema in the next comment:

caufieldjh commented 6 months ago
id: http://w3id.org/ontogpt/condition
name: condition
title: condition FHIR Template
description: >-
  A FHIR-compliant template for conditions mentioned in a clinical note
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  linkml: https://w3id.org/linkml/
  condition: http://w3id.org/ontogpt/condition/
  SNOMEDCT: http://purl.bioontology.org/ontology/SNOMEDCT

default_prefix: UNKNOWN

imports:
  - linkml:types
  - core

classes:
  ConditionSet:
    tree_root: true
    attributes:
      conditions:
        range: Condition
        multivalued: true
        inlined_as_list: true

  Condition:
    #is_a: NamedEntity
    attributes:
      label:
        description: The concise name of the condition, problem, or diagnosis.
      clinical_status:
        description: The clinical status of the condition.
        range: ConditionClinicalStatus
        ifabsent: string("unknown")  
      severity:
        description: Subjective severity of condition.
        range: ConditionDiagnosisSeverity
        ifabsent: string("unknown")
      code:
        description: The condition, problem, or diagnosis found in this note.
        range: ConditionProblemDiagnosis
        ifabsent: string("unknown")

  ConditionProblemDiagnosis:
    is_a: NamedEntity
    id_prefixes:
      - SNOMEDCT
    annotations:
      annotators: bioportal:SNOMEDCT
#    slot_usage:
#      id:
#        values_from: 
#          - ConditionProblemDiagnosisIdentifier

enums:
  ConditionClinicalStatus:
    permissible_values:
      active:
        description: The subject is currently experiencing the condition or situation, there is evidence of the condition or situation, or considered to be a significant risk.
      recurrence:
        description: The subject is experiencing a reoccurence or repeating of a previously resolved condition or situation, e.g. urinary tract infection, food insecurity.
      relapse:
        description: The subject is experiencing a return of a condition or situation after a period of improvement or remission, e.g. relapse of cancer, alcoholism.
      inactive:
        description: The subject is no longer experiencing the condition or situation and there is no longer evidence or appreciable risk of the condition or situation.
      remission:
        description: The subject is not presently experiencing the condition or situation, but there is a risk of the condition or situation returning.
      resolved:
        description: The subject is not presently experiencing the condition or situation and there is a negligible perceived risk of the condition or situation returning.
      unknown:
        description: The authoring/source system does not know which of the status values currently applies for this condition.

  ConditionDiagnosisSeverity:
    permissible_values:
      severe:
        meaning: SNOMEDCT:24484000
      moderate:
        meaning: SNOMEDCT:6736007
      mild:
        meaning: SNOMEDCT:255604002

  ConditionProblemDiagnosisIdentifier:
    reachable_from:
      source_ontology: bioportal:SNOMEDCT
      source_nodes:
        - SNOMEDCT:404684003 ## clinical finding
caufieldjh commented 6 months ago

With your input in discharge.txt and this command:

ontogpt -vvv extract -i discharge1.txt -t condition -m MODEL_GPT_4_0125_PREVIEW

I get:

extracted_object:
  conditions:
    - label: Left malignant pleural effusion
      clinical_status: unknown
      severity: severe
      code: SNOMEDCT:7771000
    - label: Respiratory Failure
      clinical_status: unknown
      code: SNOMEDCT:409622000
    - label: Postobstructive Pneumonia
      clinical_status: unknown
      code: SNOMEDCT:371072008
    - label: Gastrointestinal bleed
      clinical_status: unknown
      code: AUTO:Gastrointestinal%20bleed
    - label: Thrombocytopenia
      clinical_status: unknown
      code: SNOMEDCT:302215000
    - label: Acute Renal Failure
      clinical_status: unknown
      code: AUTO:Acute%20Renal%20Failure
    - label: Hyponatremia
      clinical_status: unknown
      code: SNOMEDCT:89627008
    - label: Non-small cell lung CA
      clinical_status: active
      code: SNOMEDCT:264885008
    - label: Leukemoid Reaction
      clinical_status: unknown
      code: SNOMEDCT:56478004
    - label: Elevated Liver Function Tests
      clinical_status: unknown
      code: SNOMEDCT:75540009
named_entities:
  - id: SNOMEDCT:7771000
    label: Left malignant pleural effusion
  - id: SNOMEDCT:409622000
    label: Respiratory failure secondary to above
  - id: SNOMEDCT:371072008
    label: Likely postobstructive pneumonia
  - id: AUTO:Gastrointestinal%20bleed
    label: Gastrointestinal bleed
  - id: SNOMEDCT:302215000
    label: Thrombocytopenia
  - id: AUTO:Acute%20Renal%20Failure
    label: Acute Renal Failure
  - id: SNOMEDCT:89627008
    label: Hyponatremia
  - id: SNOMEDCT:264885008
    label: Non-small cell lung CA, possible metastases to the bones
  - id: SNOMEDCT:56478004
    label: Leukemoid reaction, likely secondary to malignancy
  - id: SNOMEDCT:75540009
    label: Elevated liver function tests

Some of the codes assignments, like SNOMEDCT:7771000 are not quite right and the clinical statuses may not quite align with what you're looking for, but the extraction is more comprehensive. Great use of ifabsent in the schema, by the way!

caufieldjh commented 6 months ago

I've narrowed this down to a specific bug and will close this issue in favor of a new one.