Best way to apply on large quantities of documents?

monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES

https://monarch-initiative.github.io/ontogpt/

BSD 3-Clause "New" or "Revised" License

615 stars 80 forks source link

Best way to apply on large quantities of documents? #352

Open serenalotreck opened 8 months ago

serenalotreck commented 8 months ago

I have a corpus of ~75,000 abstracts that I want to make a KG out of using OntoGPT. After 4 hours, it only got through 50 documents -- not super promising!

I took a look through the docs to see if there was a parallelization option, but didn't find anything -- is there a better way to run OntoGPT over tons of documents besides making a bunch of separate small directories and submitting a bunch of different jobs?

If you have a thought about where in the code it would make sense to add parallelization capabilities I'm happy to give a shot at opening a PR!

serenalotreck commented 8 months ago

So I did some digging and it looks like OpenAI natively supports batching, all you have to do is pass a list of prompts to the create instance. However, they give the warning "the response object may not return completions in the order of the prompts, so always remember to match responses back to prompts using the index field."

I took a look through the code to see if I could figure out where to change things to allow for batching. The OpenAI client complete function only takes a single prompt, but I think you could just pass a list of prompts without changing that function itself -- besides having to change how the response is processed.

I then looked to see where that function is called, to see where you'd have to change the input. In the complete function in the CLI module, it also only has access to one doc at a time.

I'm having some trouble following the code in cli.py to find where I would need to make changes to allow batching -- wondering what your thoughts are?

caufieldjh commented 8 months ago

Sure! Batching would be a useful feature to support.

Is there additional slowdown elsewhere in the system, like in the grounding steps? 4 hours for 50 documents does sound slower than I've seen, but I think it's a factor of the total number of queries being passed to the OpenAI API, and that depends on how many classes there are in your schema, how much text each document contains, etc.

So to add the batching:

the CLI needs an option like --use_batch - see the azure_select_option as that's another option that gets passed to the OpenAI client (https://github.com/monarch-initiative/ontogpt/blob/6f7dccb5d41796ee6089c6ac89ffcc4f7808bb1c/src/ontogpt/cli.py#L223)
The corresponding logic to check for that flag would need to go before this block (https://github.com/monarch-initiative/ontogpt/blob/6f7dccb5d41796ee6089c6ac89ffcc4f7808bb1c/src/ontogpt/clients/openai_client.py#L72)
When running as batched, pass the list of prompts through the response; can probably just pass prompt instead of that messages value
The tricky part here will be producing a single extracted object result for each, but I imagine you could just iterate through the results, appending them to the payload object (and to the cache) individually, like here (https://github.com/monarch-initiative/ontogpt/blob/6f7dccb5d41796ee6089c6ac89ffcc4f7808bb1c/src/ontogpt/clients/openai_client.py#L116)

serenalotreck commented 8 months ago

Is there additional slowdown elsewhere in the system, like in the grounding steps?

That is a fantastic question -- I hadn't thought about it, but in another NER algorithm I used a while ago, the classification was super fast but the grounding took a prohibitively long time, so I ended up doing all the classification and then all the grounding at once to speed it up (as opposed to on a document-by-document basis). Is there a quick way for me to check if it's the grounding vs. OpenAI before I go ahead and try to implement batching, or should I just go through the code and add time statements? EDIT: formatting

caufieldjh commented 8 months ago

Yes - the easiest way is to repeat exactly the same extract command, since OntoGPT will cache the OpenAI results. So the amount of time the command takes to complete the second time is almost completely time spent on grounding (except for time spent retrieving things from the local cache, which should be very fast)

serenalotreck commented 8 months ago

Ok so it definitely looks like it's the grounding, here are the times from running it on a single doc with the time command in bash:

real    6m6.517s
user    0m57.297s
sys     4m40.834s

real    6m41.741s
user    1m3.146s
sys     5m17.148s

real    5m45.977s
user    0m55.273s
sys     4m28.963s

real    5m54.585s
user    0m58.343s
sys     4m43.914s

The output for this doc looks like:

extracted_object:
  id: 157a2bef-ec52-495b-b53d-0ee3c4dbcf64
  label: Drought-resistance mechanisms of seven warm-season turfgrasses under surface
    soil drying
  genes:
    - AUTO:-
  proteins:
    - AUTO:-
  molecules:
    - AUTO:-
  organisms:
    - NCBITaxon:28909
    - NCBITaxon:866555
    - NCBITaxon:158149
    - NCBITaxon:309978
    - NCBITaxon:262758
  gene_gene_interactions:
    - gene1: AUTO:BRCA1
      gene2: AUTO:TP53
  gene_protein_interactions:
    - gene: AUTO:TP53
      protein: PR:000004803
  gene_organism_relationships:
    - gene: AUTO:Notch
      organism: NCBITaxon:9606
  protein_protein_interactions:
    - protein1: AUTO:protein%20A
      protein2: PR:000009761
  protein_organism_relationships:
    - protein: AUTO:Hemoglobin
      organism: NCBITaxon:9606
  protein_molecule_interactions:
    - protein: AUTO:enzyme
      molecule: AUTO:allosteric%20inhibitor
named_entities:
  - id: AUTO:-
    label: '-'
  - id: NCBITaxon:28909
    label: Cynodon dactylon
  - id: NCBITaxon:866555
    label: Eremochloa ophiuroides
  - id: NCBITaxon:158149
    label: Paspalum vaginatum
  - id: NCBITaxon:309978
    label: Zoysia japonica
  - id: NCBITaxon:262758
    label: Zoysia tenuifolia
  - id: AUTO:BRCA1
    label: BRCA1
  - id: AUTO:TP53
    label: TP53
  - id: PR:000004803
    label: BRCA1
  - id: AUTO:Notch
    label: Notch
  - id: NCBITaxon:9606
    label: Homo sapiens
  - id: AUTO:protein%20A
    label: protein A
  - id: PR:000009761
    label: protein B
  - id: AUTO:Hemoglobin
    label: Hemoglobin
  - id: AUTO:enzyme
    label: enzyme
  - id: AUTO:allosteric%20inhibitor
    label: allosteric inhibitor

EDIT: Let me know what your thoughts are about the best way to tackle this!

I also am getting this error with -O kgx, not sure if you know off the top of your head what might be going on here or if I should open another issue about it:

Traceback (most recent call last):
  File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/bin/ontogpt", line 8, in <module>
    sys.exit(main())
  File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/mnt/ufs18/home-118/lotrecks/haai_vanburen_lab/ontogpt/src/ontogpt/cli.py", line 393, in extract
    write_extraction(results, output, output_format, ke, template)
  File "/mnt/ufs18/home-118/lotrecks/haai_vanburen_lab/ontogpt/src/ontogpt/cli.py", line 115, in write_extraction
    output.seek(0)
io.UnsupportedOperation: underlying stream is not seekable

caufieldjh commented 8 months ago

Wow, that time is much longer than I would expect. It may be because NCBITaxon and PR are two very large ontologies, and they're used as annotators in multiple classes in your schema. A quick check for this may be to temporarily disable annotation for a specific class, or change the domain of the corresponding slot to string so it doesn't ground - then run again and see if the process is faster. There are some ways to further tune the schema, plus alternative strategies for annotation, and most of them come down to using smaller sets of potential identifiers (e.g., you may be able to use a slim version of NCBITaxon if you're primarily expecting taxons to be grasses)

caufieldjh commented 8 months ago

The other thing is probably another issue, maybe one that happens when the output stream isn't the expected type.

cmungall commented 8 months ago

Also depends on how your grounding: semsql, bioportal, Gilda…

I believe there may be an embarrassingly simple optimization for grounding the naive strategy may be reindexing each invocation

On Wed, Mar 27, 2024 at 5:50 PM Harry Caufield @.***> wrote:

Wow, that time is much longer than I would expect. It may be because NCBITaxon and PR are two very large ontologies, and they're used as annotators in multiple classes in your schema. A quick check for this may be to temporarily disable annotation for a specific class, or change the domain of the corresponding slot to string so it doesn't ground - then run again and see if the process is faster. There are some ways to further tune the schema, plus alternative strategies for annotation, and most of them come down to using smaller sets of potential identifiers (e.g., you may be able to use a slim version of NCBITaxon if you're primarily expecting taxons to be grasses)

— Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/ontogpt/issues/352#issuecomment-2024212353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOI7YF4F2AKOTMCANPLY2NSMPAVCNFSM6AAAAABFDUNI32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGIYTEMZVGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

serenalotreck commented 8 months ago

A quick check for this may be to temporarily disable annotation for a specific class, or change the domain of the corresponding slot to string so it doesn't ground - then run again and see if the process is faster.

I'll give this a shot when I get the chance!

There are some ways to further tune the schema, plus alternative strategies for annotation, and most of them come down to using smaller sets of potential identifiers (e.g., you may be able to use a slim version of NCBITaxon if you're primarily expecting taxons to be grasses)

This is also a great suggestion, I am expecting them basically all to be plant species.

I haven't taken an in-depth look at the code for grounding yet, is there a way to save grounding for last and then only do it on unique entities? In the past with other algorithms something I've done is disabled grounding until I've gotten all the entities, then removed any duplicates to reduce the number of times the grounding process has to run.

serenalotreck commented 7 months ago

Also depends on how your grounding: semsql, bioportal, Gilda…

@cmungall what grounding approach am I using if my annotators are all prefixed with sqlite:obo:, and where I should go to read more about it? I'm relatively new to the ontology stuff.

I believe there may be an embarrassingly simple optimization for grounding the naive strategy may be reindexing each invocation

Could you elaborate on this?

caufieldjh commented 7 months ago

Annotators prefixed with sqlite: mean they use pre-built Semantic SQL versions of an ontology - more details here: https://github.com/INCATools/semantic-sql Those are already one of the faster options.

As for the other grounding strategies, they're described in the OAK docs here: https://incatools.github.io/ontology-access-kit/packages/implementations/index.html and in some detail in the ontogpt docs here: https://monarch-initiative.github.io/ontogpt/custom/ (see section The classes) OntoGPT can essentially use any of the OAK adapters, though in the context of this issue, some will be much slower than others.

serenalotreck commented 7 months ago

Ok so I did a closer read of the paper as well as looking at the implementation of grounding here, and took a cursory skim through the semantic-sql docs.

Annotators prefixed with sqlite: mean they use pre-built Semantic SQL versions of an ontology - more details here: https://github.com/INCATools/semantic-sql Those are already one of the faster options.

In this case, since they're already faster and provide the ontologies I want, I'm planning to stick with the Semantic SQL annotators.

I've got a few clarifying questions that I couldn't find answers for.

Does grounding involve any LLM calls? In the paper it's not clear to me whether that's the case or if you just compared LLM-based grounding to your approach, but in the code I don't see any obvious LLM calls, so I wanted to make sure my understanding was correct.
I wanted to ask about this line from the paper: “[using LinkML] allows for a full representation of the necessary schema elements while incorporating LinkML’s powerful mechanism for specifying static and dynamic value sets. For example, a value set can be constructed as a declarative query of the form “include branches A, B and C from ontology O1, excluding sub-branch D, and include all of ontology O2”.” You had mentioned earlier that paring down the ontologies I'm using would speed things up. Is this value set construction how I would do that? I'm not totally sure if that's the case, since the schema I'm using is something I designed, and I'm just calling these other ontologies as annotators.

In any case, I'd super appreciate some practical pointers on where to begin optimizing, whether that be changes to the code (is there a place for parallelization here?) or semantic-sql-related ontology changes.

EDIT: Forgot to mention that I made an equivalent schema with no annotators, and it was many many times faster, but as noted in the paper, has absolutely atrocious performance. So it's definitely the grounding side of things causing the issue.

caufieldjh commented 7 months ago

Does grounding involve any LLM calls? In the paper it's not clear to me whether that's the case or if you just compared LLM-based grounding to your approach, but in the code I don't see any obvious LLM calls, so I wanted to make sure my understanding was correct.

Nope! The grounding process does not involve the LLM at all and the LLM is never aware of the parameters you use for grounding unless you explicitly tell it (e.g., if the description for a class says "these should be Gene Ontology terms" or something). This is intentional since LLMs are prone to hallucinating completely nonexistent IDs and misrepresenting the connections between terms and IDs.

I wanted to ask about this line from the paper: “[using LinkML] allows for a full representation of the necessary schema elements while incorporating LinkML’s powerful mechanism for specifying static and dynamic value sets. For example, a value set can be constructed as a declarative query of the form “include branches A, B and C from ontology O1, excluding sub-branch D, and include all of ontology O2”.” You had mentioned earlier that paring down the ontologies I'm using would speed things up. Is this value set construction how I would do that? I'm not totally sure if that's the case, since the schema I'm using is something I designed, and I'm just calling these other ontologies as annotators.

I'm not certain if this will provide major performance boosts but it's worth a try. For the class to use the value set with, use the slot_usage slot, id, and values_from, then set the values_from to an enum. The enum can use the reachable_from slot to define the value set as including all child terms of the listed identifiers. There are some examples in the ctd_ner schema:

  Disease:
    is_a: NamedEntity
    annotations:
      annotators: "sqlite:obo:mesh, sqlite:obo:mondo, sqlite:obo:hp, sqlite:obo:ncit, sqlite:obo:doid, bioportal:meddra"
      prompt.examples: cardiac asystole, COVID-19, Headache, cancer
    # For the purposes of evaluating against BC5CDR, we force normalization to MESH
    id_prefixes:
      - MESH
    slot_usage:
      id:
        pattern: "^MESH:[CD][0-9]{6}$"
        values_from:
          - MeshDiseaseIdentifier

enums:
...
  MeshDiseaseIdentifier:
    reachable_from:
      source_ontology: obo:mesh
      source_nodes:
        - MESH:D001423 ## Bacterial Infections and Mycoses
        - MESH:D001523 ## Mental Disorders
        - MESH:D002318 ## Cardiovascular Diseases
        - MESH:D002943 ## Circulatory and Respiratory Physiological Phenomena
        - MESH:D004066 ## Digestive System Diseases
        - MESH:D004700 ## Endocrine System Diseases
        - MESH:D005128 ## Eye Diseases
        - MESH:D005261 ## Female Urogenital Diseases and Pregnancy Complications
        - MESH:D006425 ## Hemic and Lymphatic Diseases
        - MESH:D007154 ## Immune System Diseases
        - MESH:D007280 ## Disorders of Environmental Origin
        - MESH:D009057 ## Stomatognathic Diseases
        - MESH:D009140 ## Musculoskeletal Diseases
        - MESH:D009358 ## Congenital, Hereditary, and Neonatal Diseases and Abnormalities
        - MESH:D009369 ## Neoplasms
        - MESH:D009422 ## Nervous System Diseases
        - MESH:D009750 ## Nutritional and Metabolic Diseases
        - MESH:D009784 ## Occupational Diseases
        - MESH:D010038 ## Otorhinolaryngologic Diseases
        - MESH:D010272 ## Parasitic Diseases
        - MESH:D012140 ## Respiratory Tract Diseases
        - MESH:D013568 ## Pathological Conditions, Signs and Symptoms
        - MESH:D014777 ## Virus Diseases
        - MESH:D014947 ## Wounds and Injuries
        - MESH:D017437 ## Skin and Connective Tissue Diseases
        - MESH:D052801 ## Male Urogenital Diseases
        - MESH:D064419 ## Chemically-Induced Disorders

There are some good examples in the cell_type and gocam templates, too.

I suspect the size of the NCBITaxon annotator isn't helping. You could try swapping out the taxon annotator for the general NCBI taxon slim (https://raw.githubusercontent.com/obophenotype/ncbitaxon/master/subsets/taxslim.obo) - use pronto:taxslim.obo instead of sqlite:obo:ncbitaxon. The taxslim.obo file should go in the root wherever you are running ontogpt from. This may unfortunately be even slower since this is not a sqlite. It is, however, easier to edit, so you could just cut it down to the species/taxons you want to match, and end up with something like:

format-version: 1.2
synonymtypedef: anamorph "anamorph"
synonymtypedef: blast_name "blast name"
synonymtypedef: equivalent_name "equivalent name"
synonymtypedef: genbank_acronym "genbank acronym"
synonymtypedef: genbank_anamorph "genbank anamorph"
synonymtypedef: genbank_common_name "genbank common name"
synonymtypedef: genbank_synonym "genbank synonym"
synonymtypedef: in_part "in-part"
synonymtypedef: OMO:0003003 "layperson synonym"
synonymtypedef: OMO:0003006 "misspelling"
synonymtypedef: OMO:0003007 "misnomer"
synonymtypedef: OMO:0003012 "acronym"
synonymtypedef: scientific_name "scientific name"
synonymtypedef: synonym "synonym"
synonymtypedef: teleomorph "teleomorph"
ontology: ncbitaxon/subsets/taxslim

[Term]
id: NCBITaxon:1
name: root
namespace: ncbi_taxonomy
synonym: "all" RELATED synonym []
xref: GC_ID:1
xref: PMID:30365038
xref: PMID:32761142

[Term]
id: NCBITaxon:28909
name: Cynodon dactylon
namespace: ncbi_taxonomy
synonym: "Bermuda grass" EXACT genbank_common_name []
xref: GC_ID:1
is_a: NCBITaxon:1
property_value: has_rank NCBITaxon:species

Please post your full schema and we'll see if there are some other areas to optimize.

serenalotreck commented 7 months ago

This is great, thank you!

I'll hold off on adding specific identifiers to the schema until I've exhausted other options since it may not be a huge boost anyway. I'll try paring down the NCBI taxonomy early next week and let you know how it goes!

This is my current schema:

---
id: http://w3id.org/ontogpt/desiccation
name: desiccation
title: desiccationTemplate
description: A template for extracting desiccation related molecular entities and relations
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  linkml: https://w3id.org/linkml/
  desiccation: http://w3id.org/ontogpt/desiccation
default_prefix: desiccation
default_range: string
imports:
  - linkml:types
  - core
classes:
  EntityContainingDocument:
    tree_root: true
    is_a: NamedEntity
    attributes:
      genes:
        range: Gene
        multivalued: true
        description: A semicolon-separated list of genes.
      proteins:
        range: Protein
        multivalued: true
        description: A semicolon-separated list of proteins.
      molecules:
        range: Molecule
        multivalued: true
        description: A semicolon-separated list of molecules.
      organisms:
        range: Organism
        multivalued: true
        description: A semicolon-separated list of taxonomic terms of living things.
      gene_gene_interactions:
        range: GeneGeneInteraction
        multivalued: true
        description: A semicolon-separated list of gene-gene interactions.
      gene_protein_interactions:
        range: GeneProteinInteraction
        multivalued: true
        description: A semicolon-separated list of gene-protein interactions.
      gene_organism_relationships:
        range: GeneOrganismRelationship
        multivalued: true
        description: A semicolon-separated list of gene-organism relationships.
      protein_protein_interactions:
        range: ProteinProteinInteraction
        multivalued: true
        description: A semicolon-separated list of protein-protein interactions.
      protein_organism_relationships:
        range: ProteinOrganismRelationship
        multivalued: true
        description: A semicolon-separated list of protein-organism relationships.
      gene_molecule_interactions:
        range: GeneMoleculeInteraction
        multivalued: true
        description: A semicolon-separated list of gene-molecule interactions.
      protein_molecule_interactions:
        range: ProteinMoleculeInteraction
        multivalued: true
        description: A semicolon-separated list of protein-molecule interactions.
  Gene:
    is_a: NamedEntity
    id_prefixes:
      - GO
    annotations:
      annotators: sqlite:obo:go
      prompt: |-
        the name of a gene.
         Examples are Oropetium_20150105_12014, AT2G21490.
  Protein:
    is_a: NamedEntity
    id_prefixes:
      - PR
    annotations:
            annotators: sqlite:obo:pr 
  Molecule:
    is_a: NamedEntity
    id_prefixes:
      - CHEBI
    annotations:
      annotators: gilda:, sqlite:obo:chebi
  Organism:
    is_a: NamedEntity
    id_prefixes:
      - NCBITaxon
    annotations:
      annotators: sqlite:obo:ncbitaxon
      prompt: |-
        the name of a taxonomic name or species.
         Examples are Bacillus subtilus, Bos taurus, blue whale.
  GeneGeneInteraction:
    is_a: CompoundExpression
    attributes:
      gene1:
        range: Gene
        annotations:
          prompt: the name of a gene.
      gene2:
        range: Gene
        annotations:
          prompt: the name of a gene that interacts with gene1. Interactions
           can include genetic interactions, or genes whose expression has an
           effect on another gene.
  GeneProteinInteraction:
    is_a: CompoundExpression
    attributes:
      gene:
        range: Gene
        annotations:
          prompt: the name of a gene.
      protein:
        range: Protein
        annotations:
          prompt: the name of a protein that interacts with the gene.
           Interactions can include physical interactions, like a transcription
           factor that binds to a gene promoter to affect expression, or indirect
           interactions, like when the action of a protein somewhere else in the
           cell impacts the expression of a gene.
  GeneOrganismRelationship:
    is_a: CompoundExpression
    attributes:
      gene:
        range: Gene
        annotations:
          prompt: the name of a gene.
      organism:
        range: Organism
        annotations:
          prompt: the name of an organism to which the gene belongs.
  ProteinProteinInteraction:
    is_a: CompoundExpression
    attributes:
      protein1:
        range: Protein
        annotations:
          prompt: the name of a protein.
      protein2:
        range: Protein
        annotations:
          prompt: the name of a protein that interacts with protein1. An example
           of an interaction is one protein binding another, or a protein
           phosphorylating another.
  ProteinOrganismRelationship:
    is_a: CompoundExpression
    attributes:
      protein:
        range: Protein
        annotations:
          prompt: the name of a protein.
      organism:
        range: Organism
        annotations:
          prompt: the name of an organism to which the protein belongs.
  GeneMoleculeInteraction:
    is_a: CompoundExpression
    attributes:
      gene:
        range: Gene
        annotations:
          prompt: the name of a gene.
      molecule:
        range: Molecule
        annotations:
          prompt: the name of a molecule that interacts with a gene. Examples
           include a methyl group being added to a segment of DNA as part of
           methylation.
  ProteinMoleculeInteraction:
    is_a: CompoundExpression
    attributes:
      protein:
        range: Protein
        annotations:
          prompt: the name of a protein.
      molecule:
        range: Molecule
        annotations:
          prompt: the name of a molecule that interacts with the protein. An
           example of a protein-molecule interaction is when an allosteric
           inhibitor binds an enzyme's allosteric site.

I'm not actually sure that I need gilda in the Molecule class, but I was following an example from another template so I left it in there.

Any other suggestions appreciated!

serenalotreck commented 7 months ago

I tried just running the verbatim version of taxslim.obo that you provided, and damn what a speedup!!

real    1m7.614s
user    0m57.557s
sys     0m27.794s

real    1m6.706s
user    0m57.438s
sys     0m27.624s

real    1m6.617s
user    0m57.798s
sys     0m28.672s

real    1m6.654s
user    0m57.600s
sys     0m23.323s

Anecdotally too it looks like the performance is basically the same for extracting the species in the example abstract I've been using.

While that is great news, I do still have an issue related to timing: even running at 1min per abstract, it would take 55 days to run this code. Having looked at the grounding code, I feel like there is definitely a way to speed it up internally, in terms of parallelization. Have you all thought about parallelizing this section of code and decided against it because of some kind of barrier, or is it an open problem that I could try my hand at a PR for?

EDIT: Realize that I haven't tried paring down the taxonlite.obo file even further; however, not sure how much performance gain I can expect there, since it seems like it's already pretty slim?

caufieldjh commented 7 months ago

Great! Thanks for providing the schema.

I'm 110% sure there are ways to speed up the grounding, so if you feel inspired, a PR is welcome!

Using a slim version of NCBI taxon in its OBO form may not be the fastest option - but in this case you would have to get it in OWL format and then convert to semantic-sql database (like here: https://github.com/INCATools/semantic-sql?tab=readme-ov-file#creating-a-sqlite-database-from-an-owl-file).

PR and CHEBI are also larger ontologies, so there may be some speedup to be had in using other or smaller versions of those annotators.

serenalotreck commented 7 months ago

It definitely got faster after turning the slim Taxonomy into an sqlite database! Cut off about 10 seconds.

ChEBI has a Lite version of the ontology, which I made into an sqlite database and used. However, there was no speedup -- I think this is because Lite in the ChEBI case refers to the data associated with each instance in the ontology, but there are just as many terms, so there's no speedup.

I think I'm going to turn my attentions towards optimizing the code itself as opposed to the databases I'm using for normalization -- it seems generally advantageous to do that in any case.

Thanks again for all your help, I'll open a PR when I have something to show!

serenalotreck commented 7 months ago

Hi all,

Just wanted to update on this and ask some questions.

Before spending time optimizing, I decided to make sure that switching to the slim ontology didn't affect performance too badly. Unfortunately, on a sample of 1,000 docs from my dataset, switching to slim results in a loss of about 50% of groundings, as well as completely dropping 20% of entities entirely. So for my use case, optimizing performance with slim ontologies doesn't seem to be sufficient. I noticed that you opened #363, which might help, but since optimizing the schema with the slim taxonomy helped so drastically in terms of time, I'm not sure that I'd ever be able to use the full taxonomy, which may be a dealbreaker for being able to use the tool. So for the moment, I'm going to hold off on doing any optimization of the grounding code itself.

I also noticed while quantifying the outputs of the two graphs that the relation extraction performance, regardless of which taxonomy DB I used, is absolutely abysmal. I don't have a gold standard for this dataset, but just anecdotally speaking, for a dataset of 1,000 docs, only ~700 relations were extracted. I added specific prompts to each relation in the schema before running this analysis, so I'm not sure what else I can do to get better relation extraction performance. Wondering if you have any thoughts -- I looked for similar issues but didn't find any that specifically talked about engineering the relation prompts within the schema, so let me know if I should be opening a separate issue for this.

caufieldjh commented 6 months ago

Hi @serenalotreck - thanks for your patience, and thanks for looking into some areas for performance improvements! NCBITaxon is just so huge that you may see some benefit from removing the parts you definitely aren't interested in - or merging some into the Plant Ontology (it already has a small chunk of NCBITaxon but I'm not sure why: https://bioportal.bioontology.org/ontologies/PO/?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCBITaxon_131567)

As for relation extraction results, a couple things you could try:

Provide more explicit examples for both the entities and the relations

Include a high-level generic relation class and include it as a slot on the EntityContainingDocument class; this will essentially extract every association (within reason, since LLMs are rarely as comprehensive as may be preferable) so you can see if they're getting detected but not sorted properly. You can even use the core schema's Triple class, which has these slots on it already:

Triple:
abstract: true
description: Abstract parent for Relation Extraction tasks
is_a: CompoundExpression
attributes:
  subject:
    range: NamedEntity
  predicate:
    range: RelationshipType
  object:
    range: NamedEntity
  qualifier:
    range: string
    description: >-
      A qualifier for the statements, e.g. "NOT" for negation
  subject_qualifier:
    range: NamedEntity
    description: >-
      An optional qualifier or modifier for the subject of the statement, e.g. "high dose" or "intravenously administered"
  object_qualifier:
    range: NamedEntity
    description: >-
      An optional qualifier or modifier for the object of the statement, e.g. "severe" or "with additional complications"

I've anecdotally found it useful to include some details about all the necessary details for a complex expression in the first slot where it's referenced, e.g., in the dietitian_notes schema, this slot on the root class collects details about medications:

      medications:
        description: >-
          A semicolon-separated list of the patient's medications.
          This should include the medication name, dosage, frequency,
          and route of administration. Relevant acronyms: PO: per os/by mouth,
          PRN: pro re nata/as needed. 'Not provided' if not provided.
        range: DrugTherapy
        multivalued: true

Then this is the entity definition:

  DrugTherapy:
    is_a: CompoundExpression
    annotations:
      owl: IntersectionOf
    attributes:
      drug:
        description: >-
          The name of a specific drug for a patient's preventative
          or therapeutic treatment.
        range: Drug
      amount:
        description: >-
          The quantity or dosage of the drug, if provided.
          May include a frequency.
          N/A if not provided.
        range: QuantitativeValueWithFrequency
      dosage_by_unit:
        description: >-
          The unit of a patient's properties used to determine drug
          dosage. Often "kilogram". N/A if not provided.
        range: Unit
      duration:
        description: >-
          The duration of the drug therapy, if provided.
          N/A if not provided.
        range: QuantitativeValue
      route_of_administration:
        description: >-
          The route of administration for the drug therapy, if provided.
          N/A if not provided.
        range: string

To be fair, these details are usually provided adjacent to each other, unlike many relations like protein-protein interactions (except in those ideal cases like "protein A interacts with protein B"). But this kind of prompt engineering appears to help with relation extraction.