monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
609 stars 76 forks source link

`named_entities` in `output.txt` contains all entities from previous documents when run on a directory #351

Closed serenalotreck closed 5 months ago

serenalotreck commented 8 months ago

Related to the change introduced in #304.

For each new YAML output document appended to the output.txt file, the extracted_object item is correct (only contains information from the current input doc), but the named_entities object is appended to from the previous document, and so accumulated entities that aren't in the input doc in question.

EDIT: Expected behavior: For the named_entities item to only contain entities from the current doc.

A full example:

---
input_text: In tobacco, two mitogen-activated protein (MAP) kinases, designated salicylic
  acid (SA)-induced protein kinase (SIPK) and wounding-induced protein kinase (WIPK)
  are activated in a disease resistance-specific manner following pathogen infection
  or elicitor treatment. To investigate whether nitric oxide (NO), SA, ethylene, or
  jasmonic acid (JA) are involved in this phenomenon, the ability of these defense
  signals to activate these kinases was assessed. Both NO and SA activated SIPK; however,
  they did not activate WIPK. Additional analyses with transgenic NahG tobacco revealed
  that SA is required for the NO-mediated induction of SIPK. Neither JA nor ethylene
  activated SIPK or WIPK. Thus, SIPK may function downstream of SA in the NO signaling
  pathway for defense responses, while the signals responsible for resistance-associated
  WIPK activation have yet to be determined.
raw_completion_output: |-
  genes: MAPK; SIPK; WIPK; NahG
  proteins: salicylic acid-induced protein kinase; wounding-induced protein kinase
  molecules: nitric oxide; salicylic acid; ethylene; jasmonic acid
  organisms: tobacco
  gene_gene_interactions: 
  gene_protein_interactions: 
  gene_organism_relationships: 
  protein_protein_interactions: 
  protein_organism_relationships: 
  gene_molecule_interactions: 
  protein_molecule_interactions: 
  label: mitogen-activated protein (MAP) kinases
prompt: |+
  From the text below, extract the following entities in the following format:

  genes: <A semicolon-separated list of genes.>
  proteins: <A semicolon-separated list of proteins.>
  molecules: <A semicolon-separated list of molecules.>
  organisms: <A semicolon-separated list of taxonomic terms of living things.>
  gene_gene_interactions: <A semicolon-separated list of gene-gene interactions.>
  gene_protein_interactions: <A semicolon-separated list of gene-protein interactions.>
  gene_organism_relationships: <A semicolon-separated list of gene-organism relationships.>
  protein_protein_interactions: <A semicolon-separated list of protein-protein interactions.>
  protein_organism_relationships: <A semicolon-separated list of protein-organism relationships.>
  gene_molecule_interactions: <A semicolon-separated list of gene-molecule interactions.>
  protein_molecule_interactions: <A semicolon-separated list of protein-molecule interactions.>
  label: <The label (name) of the named thing>

  Text:
  In tobacco, two mitogen-activated protein (MAP) kinases, designated salicylic acid (SA)-induced protein kinase (SIPK) and wounding-induced protein kinase (WIPK) are activated in a disease resistance-specific manner following pathogen infection or elicitor treatment. To investigate whether nitric oxide (NO), SA, ethylene, or jasmonic acid (JA) are involved in this phenomenon, the ability of these defense signals to activate these kinases was assessed. Both NO and SA activated SIPK; however, they did not activate WIPK. Additional analyses with transgenic NahG tobacco revealed that SA is required for the NO-mediated induction of SIPK. Neither JA nor ethylene activated SIPK or WIPK. Thus, SIPK may function downstream of SA in the NO signaling pathway for defense responses, while the signals responsible for resistance-associated WIPK activation have yet to be determined.

  ===

extracted_object:
  id: 6a86d066-3c07-4b2a-ae25-a1d62a587dda
  label: mitogen-activated protein (MAP) kinases
  genes:
    - GO:0004707
    - AUTO:SIPK
    - AUTO:WIPK
    - AUTO:NahG
  proteins:
    - AUTO:salicylic%20acid-induced%20protein%20kinase
    - AUTO:wounding-induced%20protein%20kinase
  molecules:
    - CHEBI:16480
    - CHEBI:16914
    - CHEBI:18153
    - CHEBI:18292
  organisms:
    - NCBITaxon:4097
named_entities:
  - id: GO:0004707
    label: MAPK
  - id: AUTO:SIPK
    label: SIPK
  - id: AUTO:WIPK
    label: WIPK
  - id: AUTO:NahG
    label: NahG
  - id: AUTO:salicylic%20acid-induced%20protein%20kinase
    label: salicylic acid-induced protein kinase
  - id: AUTO:wounding-induced%20protein%20kinase
    label: wounding-induced protein kinase
  - id: CHEBI:16480
    label: nitric oxide
  - id: CHEBI:16914
    label: salicylic acid
  - id: CHEBI:18153
    label: ethylene
  - id: CHEBI:18292
    label: jasmonic acid
  - id: NCBITaxon:4097
    label: tobacco
---
input_text: Recent evidence suggests that oxidized lipid-derived molecules play significant
  roles in inducible plant defence responses against microbial pathogens, either by
  directly deterring parasite multiplication, or as signals involved in the induction
  of sets of defence genes. The synthesis of these oxylipins was hypothesized to be
  initiated by the phospholipase A2-mediated release of unsaturated fatty acids from
  membrane lipids. Here, we demonstrate that, in tobacco leaves reacting hypersensitively
  to tobacco mosaic virus, a strong increase in soluble phospholipase A2 (PLA2) activity
  occurs at the onset of necrotic lesion appearance. This rapid PLA2 activation occurred
  before the accumulation of 12-oxophytodienoic and jasmonic acids, two fatty acid-derived
  defence signals. Three PLA2 isoforms were separated and the most active enzyme was
  partially purified, its N-terminal sequence displaying similarity with patatin,
  the major storage protein in potato tubers. Three related tobacco patatin-like cDNAs,
  called NtPat1, NtPat2 and NtPat3, were cloned, with NtPat2 encoding the PLA2 isolated
  from infected leaves. RT-PCR experiments showed a rapid transcriptional activation
  of the three NtPat genes in virus-infected leaves, preceding the increase in PLA2
  activity. Recombinant NtPat1 and NtPat3 enzymes were active in an assay using labelled
  bacterial membranes, and also displayed high bona fide PLA2 activity on phosphatidylcholine
  substrate. These results point to a possible new role of patatin-like phospholipases
  in inducible plant defence responses. The induction kinetics together with the enzymatic
  activity data indicate that the NtPat proteins may provide precursors for oxylipin
  synthesis during the hypersensitive response to pathogens.
raw_completion_output: |-
  genes: NtPat1; NtPat2; NtPat3
  proteins: phospholipase A2 (PLA2); patatin
  molecules: 12-oxophytodienoic acid; jasmonic acid
  organisms: tobacco; tobacco mosaic virus
  gene_gene_interactions: 
  gene_protein_interactions: NtPat2 encodes the PLA2 isolated from infected leaves
  gene_organism_relationships: rapid transcriptional activation of NtPat genes in virus-infected leaves
  protein_protein_interactions: 
  protein_organism_relationships: 
  gene_molecule_interactions: 
  protein_molecule_interactions: 
  label: oxidized lipid-derived molecules
prompt: |+
  From the text below, extract the following entities in the following format:

  gene: <the value for gene>
  organism: <the value for organism>

  Text:
  rapid transcriptional activation of NtPat genes in virus-infected leaves

  ===

extracted_object:
  id: 8ea1b738-89ed-4b2b-b03d-92df6792a2c7
  label: oxidized lipid-derived molecules
  genes:
    - AUTO:NtPat1
    - AUTO:NtPat2
    - AUTO:NtPat3
  proteins:
    - PR:000012798
    - AUTO:patatin
  molecules:
    - CHEBI:15560
    - CHEBI:18292
  organisms:
    - NCBITaxon:4097
    - NCBITaxon:12242
  gene_protein_interactions:
    - gene: AUTO:NtPat2
      protein: PR:000012798
  gene_organism_relationships:
    - gene: AUTO:NtPat
      organism: AUTO:virus-infected%20leaves
named_entities:
  - id: GO:0004707
    label: MAPK
  - id: AUTO:SIPK
    label: SIPK
  - id: AUTO:WIPK
    label: WIPK
  - id: AUTO:NahG
    label: NahG
  - id: AUTO:salicylic%20acid-induced%20protein%20kinase
    label: salicylic acid-induced protein kinase
  - id: AUTO:wounding-induced%20protein%20kinase
    label: wounding-induced protein kinase
  - id: CHEBI:16480
    label: nitric oxide
  - id: CHEBI:16914
    label: salicylic acid
  - id: CHEBI:18153
    label: ethylene
  - id: CHEBI:18292
    label: jasmonic acid
  - id: NCBITaxon:4097
    label: tobacco
  - id: AUTO:NtPat1
    label: NtPat1
  - id: AUTO:NtPat2
    label: NtPat2
  - id: AUTO:NtPat3
    label: NtPat3
  - id: PR:000012798
    label: phospholipase A2 (PLA2)
  - id: AUTO:patatin
    label: patatin
  - id: CHEBI:15560
    label: 12-oxophytodienoic acid
  - id: NCBITaxon:12242
    label: tobacco mosaic virus
  - id: AUTO:NtPat
    label: NtPat
  - id: AUTO:virus-infected%20leaves
    label: virus-infected leaves
---
input_text: We conducted a study of the patterns and dynamics of oxidized fatty acid
  derivatives (oxylipins) in potato leaves infected with the late-blight pathogen
  Phytophthora infestans. Two 18-carbon divinyl ether fatty acids, colneleic acid
  and colnelenic acid, accumulated during disease development. To date, there are
  no reports that such compounds have been detected in higher plants. The divinyl
  ether fatty acids accumulate more rapidly in potato cultivar Matilda (a cultivar
  with increased resistance to late blight) than in cultivar Bintje, a susceptible
  cultivar. Colnelenic acid reached levels of up to approximately 24 nmol (7 microgram)
  per g fresh weight of tissue in infected leaves. By contrast, levels of members
  of the jasmonic acid family did not change significantly during pathogenesis. The
  divinyl ethers also accumulated during the incompatible interaction of tobacco with
  tobacco mosaic virus. Colneleic and colnelenic acids were found to be inhibitory
  to P. infestans, suggesting a function in plant defense for divinyl ethers, which
  are unstable compounds rarely encountered in biological systems.
raw_completion_output: |-
  genes: N/A
  proteins: N/A
  molecules: oxylipins; colneleic acid; colnelenic acid; jasmonic acid
  organisms: Phytophthora infestans; tobacco mosaic virus
  gene_gene_interactions: N/A
  gene_protein_interactions: N/A
  gene_organism_relationships: N/A
  protein_protein_interactions: N/A
  protein_organism_relationships: N/A
  gene_molecule_interactions: N/A
  protein_molecule_interactions: N/A
  label: divinyl ether fatty acids
prompt: |+
  Split the following piece of text into fields in the following format:

  protein: <the name of the protein.>
  molecule: <the name of the molecule.>

  Text:
  N/A

  ===

extracted_object:
  id: a5add351-be1d-47b3-84ca-4c35cbf80c31
  label: divinyl ether fatty acids
  genes:
    - AUTO:N/A
  proteins:
    - AUTO:N/A
  molecules:
    - CHEBI:61121
    - CHEBI:60956
    - CHEBI:60959
    - CHEBI:18292
  organisms:
    - NCBITaxon:4787
    - NCBITaxon:12242
  gene_gene_interactions:
    - gene1: AUTO:N/A
      gene2: AUTO:N/A
  gene_protein_interactions:
    - gene: AUTO:N/A
      protein: AUTO:N/A
  gene_organism_relationships:
    - gene: AUTO:N/A
      organism: AUTO:N/A
  protein_protein_interactions:
    - protein1: AUTO:N/A
      protein2: AUTO:N/A
  protein_organism_relationships:
    - gene: AUTO:N/A
      organism: AUTO:N/A
  gene_molecule_interactions:
    - gene: AUTO:N/A
      molecule: AUTO:N/A
  protein_molecule_interactions:
    - protein: AUTO:N/A
      molecule: AUTO:N/A
named_entities:
  - id: GO:0004707
    label: MAPK
  - id: AUTO:SIPK
    label: SIPK
  - id: AUTO:WIPK
    label: WIPK
  - id: AUTO:NahG
    label: NahG
  - id: AUTO:salicylic%20acid-induced%20protein%20kinase
    label: salicylic acid-induced protein kinase
  - id: AUTO:wounding-induced%20protein%20kinase
    label: wounding-induced protein kinase
  - id: CHEBI:16480
    label: nitric oxide
  - id: CHEBI:16914
    label: salicylic acid
  - id: CHEBI:18153
    label: ethylene
  - id: CHEBI:18292
    label: jasmonic acid
  - id: NCBITaxon:4097
    label: tobacco
  - id: AUTO:NtPat1
    label: NtPat1
  - id: AUTO:NtPat2
    label: NtPat2
  - id: AUTO:NtPat3
    label: NtPat3
  - id: PR:000012798
    label: phospholipase A2 (PLA2)
  - id: AUTO:patatin
    label: patatin
  - id: CHEBI:15560
    label: 12-oxophytodienoic acid
  - id: NCBITaxon:12242
    label: tobacco mosaic virus
  - id: AUTO:NtPat
    label: NtPat
  - id: AUTO:virus-infected%20leaves
    label: virus-infected leaves
  - id: AUTO:N/A
    label: N/A
  - id: CHEBI:61121
    label: oxylipins
  - id: CHEBI:60956
    label: colneleic acid
  - id: CHEBI:60959
    label: colnelenic acid
  - id: NCBITaxon:4787
    label: Phytophthora infestans
---
input_text: The plant-signaling molecules salicylic acid (SA) and jasmonic acid (JA)
  play an important role in induced disease resistance pathways. Cross-talk between
  SA- and JA-dependent pathways can result in inhibition of JA-mediated defense responses.
  We investigated possible antagonistic interactions between the SA-dependent systemic
  acquired resistance (SAR) pathway, which is induced upon pathogen infection, and
  the JA-dependent induced systemic resistance (ISR) pathway, which is triggered by
  nonpathogenic Pseudomonas rhizobacteria. In Arabidopsis thaliana, SAR and ISR are
  effective against a broad spectrum of pathogens, including the foliar pathogen Pseudomonas
  syringae pv. tomato (Pst). Simultaneous activation of SAR and ISR resulted in an
  additive effect on the level of induced protection against Pst. In Arabidopsis genotypes
  that are blocked in either SAR or ISR, this additive effect was not evident. Moreover,
  induction of ISR did not affect the expression of the SAR marker gene PR-1 in plants
  expressing SAR. Together, these observations demonstrate that the SAR and the ISR
  pathway are compatible and that there is no significant cross-talk between these
  pathways. SAR and ISR both require the key regulatory protein NPR1. Plants expressing
  both types of induced resistance did not show elevated Npr1 transcript levels, indicating
  that the constitutive level of NPR1 is sufficient to facilitate simultaneous expression
  of SAR and ISR. These results suggest that the enhanced level of protection is established
  through parallel activation of complementary, NPR1-dependent defense responses that
  are both active against Pst. Therefore, combining SAR and ISR provides an attractive
  tool for the improvement of disease control.
raw_completion_output: |-
  genes: NPR1; PR-1
  proteins: NPR1
  molecules: salicylic acid (SA); jasmonic acid (JA)
  organisms: Arabidopsis thaliana; Pseudomonas rhizobacteria; Pseudomonas syringae pv. tomato (Pst)
  gene_gene_interactions: 
  gene_protein_interactions: NPR1-PR-1
  gene_organism_relationships: 
  protein_protein_interactions: 
  protein_organism_relationships: 
  gene_molecule_interactions: 
  protein_molecule_interactions:
  label: salicylic acid; jasmonic acid; systemic acquired resistance; induced systemic resistance; NPR1; Pseudomonas syringae pv. tomato; PR-1; Arabidopsis thaliana
prompt: |+
  Split the following piece of text into fields in the following format:

  gene: <the name of the gene.>
  protein: <the name of the protein.>

  Text:
  NPR1-PR-1

  ===

extracted_object:
  id: 30dce43d-87c4-401a-b0b2-fe6c8d8092dd
  label: salicylic acid; jasmonic acid; systemic acquired resistance; induced systemic
    resistance; NPR1; Pseudomonas syringae pv. tomato; PR-1; Arabidopsis thaliana
  genes:
    - AUTO:NPR1
    - AUTO:PR-1
  proteins:
    - PR:000011377
  molecules:
    - CHEBI:35962
    - CHEBI:18292
  organisms:
    - NCBITaxon:3702
    - AUTO:Pseudomonas%20rhizobacteria
    - NCBITaxon:323
  gene_protein_interactions:
    - gene: AUTO:NPR1
      protein: AUTO:PR-1
named_entities:
  - id: GO:0004707
    label: MAPK
  - id: AUTO:SIPK
    label: SIPK
  - id: AUTO:WIPK
    label: WIPK
  - id: AUTO:NahG
    label: NahG
  - id: AUTO:salicylic%20acid-induced%20protein%20kinase
    label: salicylic acid-induced protein kinase
  - id: AUTO:wounding-induced%20protein%20kinase
    label: wounding-induced protein kinase
  - id: CHEBI:16480
    label: nitric oxide
  - id: CHEBI:16914
    label: salicylic acid
  - id: CHEBI:18153
    label: ethylene
  - id: CHEBI:18292
    label: jasmonic acid
  - id: NCBITaxon:4097
    label: tobacco
  - id: AUTO:NtPat1
    label: NtPat1
  - id: AUTO:NtPat2
    label: NtPat2
  - id: AUTO:NtPat3
    label: NtPat3
  - id: PR:000012798
    label: phospholipase A2 (PLA2)
  - id: AUTO:patatin
    label: patatin
  - id: CHEBI:15560
    label: 12-oxophytodienoic acid
  - id: NCBITaxon:12242
    label: tobacco mosaic virus
  - id: AUTO:NtPat
    label: NtPat
  - id: AUTO:virus-infected%20leaves
    label: virus-infected leaves
  - id: AUTO:N/A
    label: N/A
  - id: CHEBI:61121
    label: oxylipins
  - id: CHEBI:60956
    label: colneleic acid
  - id: CHEBI:60959
    label: colnelenic acid
  - id: NCBITaxon:4787
    label: Phytophthora infestans
  - id: AUTO:NPR1
    label: NPR1
  - id: AUTO:PR-1
    label: PR-1
  - id: PR:000011377
    label: NPR1
  - id: CHEBI:35962
    label: salicylic acid (SA)
  - id: NCBITaxon:3702
    label: Arabidopsis thaliana
  - id: AUTO:Pseudomonas%20rhizobacteria
    label: Pseudomonas rhizobacteria
  - id: NCBITaxon:323
    label: Pseudomonas syringae pv. tomato (Pst)
---
input_text: 'The phytoalexin-deficient Arabidopsis mutant pad3-1, which is affected
  in the production of the indole-type phytoalexin camalexin, has previously been
  shown not to display altered susceptibility to either the bacterium Pseudomonas
  syringae (Glazebrook & Ausubel 1994; Proc. Natl. Acad. Sci. USA, 91: 8955-8959)
  or the biotrophic fungi Peronospora parasitica (Glazebrook et al. 1997; Genetics,
  146: 381-392) and Erysiphe orontii (Reuber et al. 1998; Plant J. 16: 473-485). We
  now show that this mutant is markedly more susceptible than its wild-type parental
  line to infection by the necrotrophic fungus Alternaria brassicicola, but not to
  Botrytis cinerea. A strong camalexin response was elicited in wild-type plants inoculated
  with either Alternaria brassicicola or Botrytis cinerea, whereas no camalexin could
  be detected in pad3-1 challenged with these fungi. Hence, PAD3 appears to be a key
  determinant in resistance to at least A. brassicicola. The induction of salicylate-dependent
  and jasmonate/ethylene-dependent defense genes was not reduced in Alternaria-challenged
  pad3-1 plants compared to similarly treated wild-type plants. Camalexin production
  could not be triggered by exogenous application of either salicylate, ethylene or
  jasmonate and was not, or not strongly, reduced in mutants with defects in perception
  of these defense-related signal molecules. Camalexin-production appears to be controlled
  by a pathway that exhibits little cross-talk with salicylate-, ethylene- and jasmonate-dependent
  signalling events.'
raw_completion_output: |-
  genes: pad3-1; Pseudomonas syringae; Peronospora parasitica; Erysiphe orontii; Alternaria brassicicola; Botrytis cinerea

  proteins: PAD3

  molecules: camalexin; salicylate; ethylene; jasmonate

  organisms: Arabidopsis; Pseudomonas syringae; Peronospora parasitica; Erysiphe orontii; Alternaria brassicicola; Botrytis cinerea

  gene_gene_interactions: 

  gene_protein_interactions: 

  gene_organism_relationships: 

  protein_protein_interactions: 

  protein_organism_relationships: 

  gene_molecule_interactions: 

  protein_molecule_interactions: 

  label: phytoalexin-deficient Arabidopsis mutant
prompt: |+
  From the text below, extract the following entities in the following format:

  genes: <A semicolon-separated list of genes.>
  proteins: <A semicolon-separated list of proteins.>
  molecules: <A semicolon-separated list of molecules.>
  organisms: <A semicolon-separated list of taxonomic terms of living things.>
  gene_gene_interactions: <A semicolon-separated list of gene-gene interactions.>
  gene_protein_interactions: <A semicolon-separated list of gene-protein interactions.>
  gene_organism_relationships: <A semicolon-separated list of gene-organism relationships.>
  protein_protein_interactions: <A semicolon-separated list of protein-protein interactions.>
  protein_organism_relationships: <A semicolon-separated list of protein-organism relationships.>
  gene_molecule_interactions: <A semicolon-separated list of gene-molecule interactions.>
  protein_molecule_interactions: <A semicolon-separated list of protein-molecule interactions.>
  label: <The label (name) of the named thing>

  Text:
  The phytoalexin-deficient Arabidopsis mutant pad3-1, which is affected in the production of the indole-type phytoalexin camalexin, has previously been shown not to display altered susceptibility to either the bacterium Pseudomonas syringae (Glazebrook & Ausubel 1994; Proc. Natl. Acad. Sci. USA, 91: 8955-8959) or the biotrophic fungi Peronospora parasitica (Glazebrook et al. 1997; Genetics, 146: 381-392) and Erysiphe orontii (Reuber et al. 1998; Plant J. 16: 473-485). We now show that this mutant is markedly more susceptible than its wild-type parental line to infection by the necrotrophic fungus Alternaria brassicicola, but not to Botrytis cinerea. A strong camalexin response was elicited in wild-type plants inoculated with either Alternaria brassicicola or Botrytis cinerea, whereas no camalexin could be detected in pad3-1 challenged with these fungi. Hence, PAD3 appears to be a key determinant in resistance to at least A. brassicicola. The induction of salicylate-dependent and jasmonate/ethylene-dependent defense genes was not reduced in Alternaria-challenged pad3-1 plants compared to similarly treated wild-type plants. Camalexin production could not be triggered by exogenous application of either salicylate, ethylene or jasmonate and was not, or not strongly, reduced in mutants with defects in perception of these defense-related signal molecules. Camalexin-production appears to be controlled by a pathway that exhibits little cross-talk with salicylate-, ethylene- and jasmonate-dependent signalling events.

  ===

extracted_object:
  id: 57280cfa-7fba-4016-9e6a-8682b107702d
  label: phytoalexin-deficient Arabidopsis mutant
  genes:
    - AUTO:pad3-1
    - AUTO:Pseudomonas%20syringae
    - AUTO:Peronospora%20parasitica
    - AUTO:Erysiphe%20orontii
    - AUTO:Alternaria%20brassicicola
    - AUTO:Botrytis%20cinerea
  proteins:
    - PR:000012221
  molecules:
    - CHEBI:22990
    - CHEBI:30762
    - CHEBI:18153
    - CHEBI:58431
  organisms:
    - NCBITaxon:3701
    - NCBITaxon:317
    - NCBITaxon:123356
    - NCBITaxon:62715
    - NCBITaxon:29001
    - NCBITaxon:40559
named_entities:
  - id: GO:0004707
    label: MAPK
  - id: AUTO:SIPK
    label: SIPK
  - id: AUTO:WIPK
    label: WIPK
  - id: AUTO:NahG
    label: NahG
  - id: AUTO:salicylic%20acid-induced%20protein%20kinase
    label: salicylic acid-induced protein kinase
  - id: AUTO:wounding-induced%20protein%20kinase
    label: wounding-induced protein kinase
  - id: CHEBI:16480
    label: nitric oxide
  - id: CHEBI:16914
    label: salicylic acid
  - id: CHEBI:18153
    label: ethylene
  - id: CHEBI:18292
    label: jasmonic acid
  - id: NCBITaxon:4097
    label: tobacco
  - id: AUTO:NtPat1
    label: NtPat1
  - id: AUTO:NtPat2
    label: NtPat2
  - id: AUTO:NtPat3
    label: NtPat3
  - id: PR:000012798
    label: phospholipase A2 (PLA2)
  - id: AUTO:patatin
    label: patatin
  - id: CHEBI:15560
    label: 12-oxophytodienoic acid
  - id: NCBITaxon:12242
    label: tobacco mosaic virus
  - id: AUTO:NtPat
    label: NtPat
  - id: AUTO:virus-infected%20leaves
    label: virus-infected leaves
  - id: AUTO:N/A
    label: N/A
  - id: CHEBI:61121
    label: oxylipins
  - id: CHEBI:60956
    label: colneleic acid
  - id: CHEBI:60959
    label: colnelenic acid
  - id: NCBITaxon:4787
    label: Phytophthora infestans
  - id: AUTO:NPR1
    label: NPR1
  - id: AUTO:PR-1
    label: PR-1
  - id: PR:000011377
    label: NPR1
  - id: CHEBI:35962
    label: salicylic acid (SA)
  - id: NCBITaxon:3702
    label: Arabidopsis thaliana
  - id: AUTO:Pseudomonas%20rhizobacteria
    label: Pseudomonas rhizobacteria
  - id: NCBITaxon:323
    label: Pseudomonas syringae pv. tomato (Pst)
  - id: AUTO:pad3-1
    label: pad3-1
  - id: AUTO:Pseudomonas%20syringae
    label: Pseudomonas syringae
  - id: AUTO:Peronospora%20parasitica
    label: Peronospora parasitica
  - id: AUTO:Erysiphe%20orontii
    label: Erysiphe orontii
  - id: AUTO:Alternaria%20brassicicola
    label: Alternaria brassicicola
  - id: AUTO:Botrytis%20cinerea
    label: Botrytis cinerea
  - id: PR:000012221
    label: PAD3
  - id: CHEBI:22990
    label: camalexin
  - id: CHEBI:30762
    label: salicylate
  - id: CHEBI:58431
    label: jasmonate
  - id: NCBITaxon:3701
    label: Arabidopsis
  - id: NCBITaxon:317
    label: Pseudomonas syringae
  - id: NCBITaxon:123356
    label: Peronospora parasitica
  - id: NCBITaxon:62715
    label: Erysiphe orontii
  - id: NCBITaxon:29001
    label: Alternaria brassicicola
  - id: NCBITaxon:40559
    label: Botrytis cinerea
caufieldjh commented 8 months ago

Thanks for pointing this out @serenalotreck - should be a quick fix.

serenalotreck commented 5 months ago

@caufieldjh wondering if this has been fixed? I'm running OntoGPT on a large quantity of documents and the ballooining size of the YAML is severely slowing down my ability to parse it into KGX format -- it takes several hours just to read in the YAML file.

Not related, but the slowness of the import is causing problems -- it seems like ChatGPT is putting in non-allowed unicode characters in random places, which breaks YAML safe_load, but it takes several hours for me to locate each one via trying to read it and having it break again. I'm currently working on trying to find them all preemptively and remove them before trying to read in the YAML file, but it seems like something that shouldn't be happening in the first place. I haven't tried making a small reproducible example (am under a deadline), so I won't open a new issue yet, but wondered if you'd experienced anything similar.

caufieldjh commented 5 months ago

Hi @serenalotreck - going to attempt a fix for this today.

I haven't explicitly seen any issues with GPT emitting weird unicode characters, but it seems inevitable to happen among any sufficiently large collection of extractions, and we've seen something potentially related when extracting from many PubMed entries. I'm going to consider this issue related to #323 as there should be preprocessing to handle it.

caufieldjh commented 5 months ago

OK, please try pulling the most recent repo version and let me know if you're still seeing redundant named entities.

serenalotreck commented 5 months ago

Looks like that fixed it, thanks!