monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
548 stars 68 forks source link

Using `csv_wrapper` #343

Closed serenalotreck closed 3 months ago

serenalotreck commented 4 months ago

I'm applying OntoGPT over a large (~100,000) set of documents. I have the extraction working fine, and have the output for each document saved to an individual file.

But I need to transform the YAML/plain text output into a csv file with triples that I can use downstream. I've found src/ontogpt/io/csv_wrapper, but it's not documented and I'm having trouble figuring out how to use it.

Both functions in the wrapper take an object and a filepath. From the code it looks like the filepath is to a .yaml file that's an output from OntoGPT, but I have no clue what the object is supposed to be. Additionally, I'm not sure if thse functions are written to a specific extraction schema, since phrases like "No gene to molecular activity relationships mentioned in the text." are mentioned in NULL_VALS.

Any help here is appreciated!

caufieldjh commented 4 months ago

Hi @serenalotreck - the csv_wrapper is unfortunately an incomplete set of functions right now, though its goal was to do exactly what you're describing: turn output into a set of triples.

The linkml-transformer tools may help you - check out https://github.com/linkml/linkml-transformer and this tutorial notebook: https://github.com/linkml/linkml-transformer/blob/main/notebooks/Tutorial.ipynb In this case, since you already have all the YAML output, linkml-transformer should be able to convert it to a format more compatible with being CSV/TSV (i.e., it will do everything but that last file conversion). It may require you to write a small target schema like in the tutorial.

Hopefully we can get this functionality added to OntoGPT soon, too.

serenalotreck commented 3 months ago

Thanks for the direction! I'm starting to work on implementing this now, do you think it makes sense for me to add it directly to csv_wrapper.py and open a PR when I'm done? Or is there someplace else it should go?

caufieldjh commented 3 months ago

Sure, go for it! All PRs are appreciated.

serenalotreck commented 3 months ago

Curious for your thoughts here. I read in the YAML output with pyyaml, and this is what one sample doc looks like:

{'input_text': 'In tobacco, two mitogen-activated protein (MAP) kinases, designated salicylic acid (SA)-induced protein kinase (SIPK) and wounding-induced protein kinase (WIPK) are activated in a disease resistance-specific manner following pathogen infection or elicitor treatment. To investigate whether nitric oxide (NO), SA, ethylene, or jasmonic acid (JA) are involved in this phenomenon, the ability of these defense signals to activate these kinases was assessed. Both NO and SA activated SIPK; however, they did not activate WIPK. Additional analyses with transgenic NahG tobacco revealed that SA is required for the NO-mediated induction of SIPK. Neither JA nor ethylene activated SIPK or WIPK. Thus, SIPK may function downstream of SA in the NO signaling pathway for defense responses, while the signals responsible for resistance-associated WIPK activation have yet to be determined.',
 'raw_completion_output': 'genes: MAPK; SIPK; WIPK; NahG\nproteins: salicylic acid-induced protein kinase; wounding-induced protein kinase\nmolecules: nitric oxide; salicylic acid; ethylene; jasmonic acid\norganisms: tobacco\ngene_gene_interactions: \ngene_protein_interactions: \ngene_organism_relationships: \nprotein_protein_interactions: \nprotein_organism_relationships: \ngene_molecule_interactions: \nprotein_molecule_interactions: \nlabel: mitogen-activated protein (MAP) kinases',
 'prompt': 'From the text below, extract the following entities in the following format:\n\ngenes: <A semicolon-separated list of genes.>\nproteins: <A semicolon-separated list of proteins.>\nmolecules: <A semicolon-separated list of molecules.>\norganisms: <A semicolon-separated list of taxonomic terms of living things.>\ngene_gene_interactions: <A semicolon-separated list of gene-gene interactions.>\ngene_protein_interactions: <A semicolon-separated list of gene-protein interactions.>\ngene_organism_relationships: <A semicolon-separated list of gene-organism relationships.>\nprotein_protein_interactions: <A semicolon-separated list of protein-protein interactions.>\nprotein_organism_relationships: <A semicolon-separated list of protein-organism relationships.>\ngene_molecule_interactions: <A semicolon-separated list of gene-molecule interactions.>\nprotein_molecule_interactions: <A semicolon-separated list of protein-molecule interactions.>\nlabel: <The label (name) of the named thing>\n\n\nText:\nIn tobacco, two mitogen-activated protein (MAP) kinases, designated salicylic acid (SA)-induced protein kinase (SIPK) and wounding-induced protein kinase (WIPK) are activated in a disease resistance-specific manner following pathogen infection or elicitor treatment. To investigate whether nitric oxide (NO), SA, ethylene, or jasmonic acid (JA) are involved in this phenomenon, the ability of these defense signals to activate these kinases was assessed. Both NO and SA activated SIPK; however, they did not activate WIPK. Additional analyses with transgenic NahG tobacco revealed that SA is required for the NO-mediated induction of SIPK. Neither JA nor ethylene activated SIPK or WIPK. Thus, SIPK may function downstream of SA in the NO signaling pathway for defense responses, while the signals responsible for resistance-associated WIPK activation have yet to be determined.\n\n===\n\n',
 'extracted_object': {'id': '6a86d066-3c07-4b2a-ae25-a1d62a587dda',
  'label': 'mitogen-activated protein (MAP) kinases',
  'genes': ['GO:0004707', 'AUTO:SIPK', 'AUTO:WIPK', 'AUTO:NahG'],
  'proteins': ['AUTO:salicylic%20acid-induced%20protein%20kinase',
   'AUTO:wounding-induced%20protein%20kinase'],
  'molecules': ['CHEBI:16480', 'CHEBI:16914', 'CHEBI:18153', 'CHEBI:18292'],
  'organisms': ['NCBITaxon:4097']},
 'named_entities': [{'id': 'GO:0004707', 'label': 'MAPK'},
  {'id': 'AUTO:SIPK', 'label': 'SIPK'},
  {'id': 'AUTO:WIPK', 'label': 'WIPK'},
  {'id': 'AUTO:NahG', 'label': 'NahG'},
  {'id': 'AUTO:salicylic%20acid-induced%20protein%20kinase',
   'label': 'salicylic acid-induced protein kinase'},
  {'id': 'AUTO:wounding-induced%20protein%20kinase',
   'label': 'wounding-induced protein kinase'},
  {'id': 'CHEBI:16480', 'label': 'nitric oxide'},
  {'id': 'CHEBI:16914', 'label': 'salicylic acid'},
  {'id': 'CHEBI:18153', 'label': 'ethylene'},
  {'id': 'CHEBI:18292', 'label': 'jasmonic acid'},
  {'id': 'NCBITaxon:4097', 'label': 'tobacco'}]}

I feel like I might not need linkml-transformer since the output here is already structured as a dictionary; thoughts?

serenalotreck commented 3 months ago

Something else I'd appreciate a thought on -- would it be better to have a separate dataframe for just entities, and then one for relations? I can imagine that some schema will only extract entities so it would be excessive to have columns in a dataframe for non-existent relations, and for schema that do extract relations, there are also loose entities that don't belong to any relation. It seems suboptimal to return two dataframes, so I had drafted this potential set of columns:

entity1_label, entity1_id, entity1_type, rel_type, entity2_label, entity2_id, entity2_type, original_rel_text, extracted_obj_id

and then I would just put NaN in all columns besides those for entity 1 for loose entities.

caufieldjh commented 3 months ago

I feel like I might not need linkml-transformer since the output here is already structured as a dictionary; thoughts?

True, you could just output the dictionary as-is, but even in this case it isn't clear what a triple should look like. Using a schema as part of the transformation process can help.

would it be better to have a separate dataframe for just entities, and then one for relations?

This would align closely with the KGX standard, which was our original goal with this function. Should be easier to to it that way, too, since the named_entities are all in their own section of the YAML.

it would be excessive to have columns in a dataframe for non-existent relations

That depends on what you consider a relation, of course! The LinkML schemas don't really know or care about whether you're following a property graph model or not, so some users may actually want to express something like {'id': 'CHEBI:16480', 'label': 'nitric oxide'} as CHEBI:16480 rdfs:label 'nitric oxide'. But if we make some assumptions like that we're just going to use the KGX standard linked above, then yes, a given extraction may or may not have defined relations but should have entities.

These are some format-compliant headers:

For nodes/entities -

id  category    name    description provided_by

For edges/relations -

id  subject predicate   object  category

These are minimal; other fields can go in each but these should cover the basics. Some fields like category and description can be looked up through OAK so they don't even really need to be populated as part of the transform.

Anyway, please let me know how that works with your use case!

serenalotreck commented 3 months ago

Thanks for the feedback! I implemented a version with the column names I previously suggested & just one df, but should be straightforward to do it this way instead. The one potential issue I see is that as far as I can tell, OntoGPT doesn't assign an ID to relations the same way it does for the entities -- should I just assign an arbitrary integer ID for the relations?

For example, the exctracted_object property of the output of one of my docs that has relations looks like this:

{'id': '8ea1b738-89ed-4b2b-b03d-92df6792a2c7',
 'label': 'oxidized lipid-derived molecules',
 'genes': ['AUTO:NtPat1', 'AUTO:NtPat2', 'AUTO:NtPat3'],
 'proteins': ['PR:000012798', 'AUTO:patatin'],
 'molecules': ['CHEBI:15560', 'CHEBI:18292'],
 'organisms': ['NCBITaxon:4097', 'NCBITaxon:12242'],
 'gene_protein_interactions': [{'gene': 'AUTO:NtPat2',
   'protein': 'PR:000012798'}],
 'gene_organism_relationships': [{'gene': 'AUTO:NtPat',
   'organism': 'AUTO:virus-infected%20leaves'}]}

So the strings used to list the entities are their IDs, but there's no ID for relations.

caufieldjh commented 3 months ago

Good point! Elsewhere in ontogpt (like here: https://github.com/monarch-initiative/ontogpt/blob/bba969258dbe30436a5df880f5a016f4409d89ee/src/ontogpt/engines/spires_engine.py#L568) we autogenerate ids where they're missing, and that should work here too. An arbitrary integer would also work well.

I should also mention the existence of linkml-convert - it may be a bit closer to what you're looking for, with the caveat that I haven't been able to get it to work consistently with ontogpt extract outputs.

serenalotreck commented 3 months ago

Thanks for the suggestion! I've managed a relatively simple implementation without using either linkml-transformer or linkml-convert, but I'm not sure if I've relied on assumptions that aren't generalizable. I'll open a PR when I'm done refactoring to make it KGX compliant, curious to see what you think!