monarch-initiative / monarch-app

Monarch Initiative website and API
https://monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
17 stars 4 forks source link

Text annotator: modifiers parsed as entities #564

Closed caufieldjh closed 7 months ago

caufieldjh commented 8 months ago

In using the text annotator with the abstract of this case report: https://pubmed.ncbi.nlm.nih.gov/38130915/ some modifiers, like persistent or probable are incorrectly parsed as disease or gene entities.

The phrase Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) looks like this in the output, as an additional example:

  {
    "text": "Severe",
    "tokens": [
      {
        "id": "MONDO:0600009",
        "category": "biolink:Disease",
        "name": "severe hypophosphatasia",
        "full_name": null,
        "deprecated": null,
        "description": "Severe hypophosphatasia is a rare, severe form of hypophosphatasia characterized by infantile rickets without elevated serum alkaline phosphatase (ALP) activity and a wide range of clinical manifestations due to hypomineralization. Individuals often present with these features in infancy or in the perinatal period.",
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [],
        "uri": null
      },
      {
        "id": "MONDO:0001641",
        "category": "biolink:Disease",
        "name": "severe pre-eclampsia",
        "full_name": null,
        "deprecated": null,
        "description": "Preeclampsia with a systolic blood pressure of 160 mmHg or higher, or a diastolic blood pressure of 110 mmHg or higher on two occasions at least 4 hours apart while on bedrest. It is associated with thrombocytopenia (platelets less than 100,000 per microliter), impaired liver function (twice normal elevation of hepatic transaminases; severe, persistent right upper quadrant or epigastric pain), progressive renal insufficiency (serum creatinine greater than 1.1 mg/dL or doubling of baseline in the absence of other renal disease), pulmonary edema, or new-onset cerebral or visual disturbances.",
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [
          "Preeclampsia with severe features",
          "antepartum severe pre-eclampsia",
          "postpartum severe pre-eclampsia",
          "severe pre-eclampsia, with delivery",
          "severe preeclampsia"
        ],
        "uri": null
      },
      {
        "id": "MONDO:0008819",
        "category": "biolink:Disease",
        "name": "arteriosclerosis, severe juvenile",
        "full_name": null,
        "deprecated": null,
        "description": null,
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [
          "arteriosclerosis, severe juvenile"
        ],
        "uri": null
      }
    ],
    "start": 653,
    "end": 659
  },
  {
    "text": "Acute Respiratory Syndrome",
    "tokens": [
      {
        "id": "MONDO:0005091",
        "category": "biolink:Disease",
        "name": "severe acute respiratory syndrome",
        "full_name": null,
        "deprecated": null,
        "description": "A viral respiratory infection caused by the SARS coronavirus. It is transmitted through close person-to-person contact. It is manifested with high fever, headache, dry cough and myalgias. It may progress to pneumonia and cause death.",
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [
          "SARS",
          "SARS coronavirus caused disease or disorder",
          "SARS coronavirus disease or disorder",
          "SARS coronavirus infectious disease",
          "SARS-CoV infection",
          "acute respiratory coronavirus infection"
        ],
        "uri": null
      },
      {
        "id": "MONDO:0006502",
        "category": "biolink:Disease",
        "name": "acute respiratory distress syndrome",
        "full_name": null,
        "deprecated": null,
        "description": "Progressive and life-threatening pulmonary distress in the absence of an underlying pulmonary condition, usually following major trauma or surgery. Cases of neonatal respiratory distress syndrome are not included in this definition.",
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [
          "ALI",
          "ARDS",
          "Stiff lung",
          "acute lung injury",
          "acute respiratory distress syndrome",
          "increased-permeability pulmonary edema",
          "increased-permeability pulmonary oedema",
          "non-cardiogenic pulmonary edema",
          "non-cardiogenic pulmonary oedema",
          "shock lung"
        ],
        "uri": null
      },
      {
        "id": "MONDO:0100130",
        "category": "biolink:Disease",
        "name": "adult acute respiratory distress syndrome",
        "full_name": null,
        "deprecated": null,
        "description": "A very severe form of acute pulmonary failure secondary to capillary permeability impairment. The symptoms include dyspnea, hypotension and multivisceral failure. The disease is characterized by bilateral pulmonary infiltrates and severe hypoxemia due to increased alveolar-capillary permeability. The severity depends on the degree of alveolar epithelial injury, with a mortality rate of 30-50%.",
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [
          "ARDS",
          "adult ARDS",
          "adult RDS",
          "adult acute respiratory distress syndrome",
          "adult respiratory distress syndrome",
          "adult respiratory distress syndrome, ARDS",
          "respiratory distress syndrome, adult"
        ],
        "uri": null
      }
    ],
    "start": 660,
    "end": 686
  },
  {
    "text": "Coronavirus 2",
    "tokens": [
      {
        "id": "MONDO:0100096",
        "category": "biolink:Disease",
        "name": "COVID-19",
        "full_name": null,
        "deprecated": null,
        "description": "A disease caused by infection with severe acute respiratory syndrome coronavirus 2.",
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [
          "2019 novel coronavirus",
          "2019 novel coronavirus infection",
          "2019-nCoV",
          "2019-nCoV infection",
          "SARS-CoV-2",
          "SARS-coronavirus 2",
          "beta-CoV",
          "beta-CoVs",
          "betacoronavirus",
          "coronavirus disease 2019",
          "severe acute respiratory syndrome coronavirus 2",
          "severe acute respiratory syndrome coronavirus 2 infectious disease",
          "β-CoV",
          "β-CoVs",
          "β-coronavirus"
        ],
        "uri": null
      },
      {
        "id": "MONDO:0100163",
        "category": "biolink:Disease",
        "name": "COVID-19–associated multisystem inflammatory syndrome in children",
        "full_name": null,
        "deprecated": null,
        "description": "A inflammatory syndrome in children infected by the SARS-CoV-2 with similarities to Kawasaki disease. Clinical manifestations range from fever and inflammation to myocardial injury, shock, and development of coronary artery aneurysms.",
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [
          "COVID-19 -related paediatric inflammatory multisystem syndrome",
          "COVID-19 -related pediatric inflammatory multisystem syndrome",
          "COVID-19 Kawasaki-like syndrome",
          "COVID-19 associated multisystem inflammatory syndrome in children",
          "MIS-C",
          "PIMS",
          "PIMS-TS",
          "PMIS",
          "SARS-CoV-2 Kawasaki-like syndrome",
          "multisystem inflammatory syndrome in children",
          "multisystem inflammatory syndrome in children associated with COVID-19",
          "multisystem inflammatory syndrome in children associated with coronavirus disease 2019",
          "paediatric inflammatory multisystem syndrome",
          "paediatric inflammatory multisystem syndrome temporally associated with SARS-CoV-2",
          "paediatric inflammatory multisystem syndrome: temporally associated with SARS-CoV-2",
          "paediatric multi-system inflammatory syndrome potentially associated with COVID-19",
          "paediatric multisystem inflammatory syndrome",
          "pediatric inflammatory multisystem syndrome",
          "pediatric inflammatory multisystem syndrome temporally associated with SARS-CoV-2",
          "pediatric inflammatory multisystem syndrome: temporally associated with SARS-CoV-2",
          "pediatric multi-system inflammatory syndrome potentially associated with COVID-19",
          "pediatric multisystem inflammatory syndrome"
        ],
        "uri": null
      },
      {
        "id": "HP:0005396",
        "category": "biolink:PhenotypicFeature",
        "name": "Susceptibility to coronavirus 229e",
        "full_name": null,
        "deprecated": null,
        "description": "Increased susceptibility to coronavirus 229e, as manifested by recurrent episodes of coronavirus 229e.",
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [],
        "uri": null
      }
    ],
    "start": 687,
    "end": 700
  },

That's a tricky one because severe is still a modifier but also the in the name of the disease. Overall, I wouldn't expect severe alone to be an entity, and in an ideal world it would be linked to the disease (and in some cases there may even be a more appropriate entity that way)

madanucd commented 7 months ago

The following PR resolves this issue where recognized entities are queried for exact match in solr: image Persistent and probable are not being annotated. However, "severe" was identified as PhenotypicFeature as shown below:

{
    "text": "Severe",
    "tokens": [
{
        "id": "HP:0012828",
        "category": "biolink:PhenotypicFeature",
        "name": "Severe",
        "full_name": null,
        "deprecated": null,
        "description": "Having a high degree of severity. For quantitative traits, a deviation of between four and five standard deviations from the appropriate population mean.",
        "xref": [],
        "provided_by": "phenio_nodes",
        "in_taxon": null,
        "in_taxon_label": null,
        "symbol": null,
        "synonym": [
          "Severe"
],
        "uri": null
}
    ],
    "start": 653,
    "end": 659
},