Open serenalotreck opened 8 months ago
So I did some digging and it looks like OpenAI natively supports batching, all you have to do is pass a list of prompts to the create
instance. However, they give the warning "the response object may not return completions in the order of the prompts, so always remember to match responses back to prompts using the index field."
I took a look through the code to see if I could figure out where to change things to allow for batching. The OpenAI client complete function only takes a single prompt, but I think you could just pass a list of prompts without changing that function itself -- besides having to change how the response is processed.
I then looked to see where that function is called, to see where you'd have to change the input. In the complete function in the CLI module, it also only has access to one doc at a time.
I'm having some trouble following the code in cli.py
to find where I would need to make changes to allow batching -- wondering what your thoughts are?
Sure! Batching would be a useful feature to support.
Is there additional slowdown elsewhere in the system, like in the grounding steps? 4 hours for 50 documents does sound slower than I've seen, but I think it's a factor of the total number of queries being passed to the OpenAI API, and that depends on how many classes there are in your schema, how much text each document contains, etc.
So to add the batching:
--use_batch
- see the azure_select_option as that's another option that gets passed to the OpenAI client (https://github.com/monarch-initiative/ontogpt/blob/6f7dccb5d41796ee6089c6ac89ffcc4f7808bb1c/src/ontogpt/cli.py#L223)response
; can probably just pass prompt
instead of that messages
valuepayload
object (and to the cache) individually, like here (https://github.com/monarch-initiative/ontogpt/blob/6f7dccb5d41796ee6089c6ac89ffcc4f7808bb1c/src/ontogpt/clients/openai_client.py#L116)Is there additional slowdown elsewhere in the system, like in the grounding steps?
That is a fantastic question -- I hadn't thought about it, but in another NER algorithm I used a while ago, the classification was super fast but the grounding took a prohibitively long time, so I ended up doing all the classification and then all the grounding at once to speed it up (as opposed to on a document-by-document basis). Is there a quick way for me to check if it's the grounding vs. OpenAI before I go ahead and try to implement batching, or should I just go through the code and add time
statements? EDIT: formatting
Yes - the easiest way is to repeat exactly the same extract command, since OntoGPT will cache the OpenAI results. So the amount of time the command takes to complete the second time is almost completely time spent on grounding (except for time spent retrieving things from the local cache, which should be very fast)
Ok so it definitely looks like it's the grounding, here are the times from running it on a single doc with the time
command in bash:
real 6m6.517s
user 0m57.297s
sys 4m40.834s
real 6m41.741s
user 1m3.146s
sys 5m17.148s
real 5m45.977s
user 0m55.273s
sys 4m28.963s
real 5m54.585s
user 0m58.343s
sys 4m43.914s
The output for this doc looks like:
extracted_object:
id: 157a2bef-ec52-495b-b53d-0ee3c4dbcf64
label: Drought-resistance mechanisms of seven warm-season turfgrasses under surface
soil drying
genes:
- AUTO:-
proteins:
- AUTO:-
molecules:
- AUTO:-
organisms:
- NCBITaxon:28909
- NCBITaxon:866555
- NCBITaxon:158149
- NCBITaxon:309978
- NCBITaxon:262758
gene_gene_interactions:
- gene1: AUTO:BRCA1
gene2: AUTO:TP53
gene_protein_interactions:
- gene: AUTO:TP53
protein: PR:000004803
gene_organism_relationships:
- gene: AUTO:Notch
organism: NCBITaxon:9606
protein_protein_interactions:
- protein1: AUTO:protein%20A
protein2: PR:000009761
protein_organism_relationships:
- protein: AUTO:Hemoglobin
organism: NCBITaxon:9606
protein_molecule_interactions:
- protein: AUTO:enzyme
molecule: AUTO:allosteric%20inhibitor
named_entities:
- id: AUTO:-
label: '-'
- id: NCBITaxon:28909
label: Cynodon dactylon
- id: NCBITaxon:866555
label: Eremochloa ophiuroides
- id: NCBITaxon:158149
label: Paspalum vaginatum
- id: NCBITaxon:309978
label: Zoysia japonica
- id: NCBITaxon:262758
label: Zoysia tenuifolia
- id: AUTO:BRCA1
label: BRCA1
- id: AUTO:TP53
label: TP53
- id: PR:000004803
label: BRCA1
- id: AUTO:Notch
label: Notch
- id: NCBITaxon:9606
label: Homo sapiens
- id: AUTO:protein%20A
label: protein A
- id: PR:000009761
label: protein B
- id: AUTO:Hemoglobin
label: Hemoglobin
- id: AUTO:enzyme
label: enzyme
- id: AUTO:allosteric%20inhibitor
label: allosteric inhibitor
EDIT: Let me know what your thoughts are about the best way to tackle this!
I also am getting this error with -O kgx
, not sure if you know off the top of your head what might be going on here or if I should open another issue about it:
Traceback (most recent call last):
File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/bin/ontogpt", line 8, in <module>
sys.exit(main())
File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/mnt/home/lotrecks/anaconda3/envs/ontogpt/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/mnt/ufs18/home-118/lotrecks/haai_vanburen_lab/ontogpt/src/ontogpt/cli.py", line 393, in extract
write_extraction(results, output, output_format, ke, template)
File "/mnt/ufs18/home-118/lotrecks/haai_vanburen_lab/ontogpt/src/ontogpt/cli.py", line 115, in write_extraction
output.seek(0)
io.UnsupportedOperation: underlying stream is not seekable
Wow, that time is much longer than I would expect.
It may be because NCBITaxon and PR are two very large ontologies, and they're used as annotators in multiple classes in your schema.
A quick check for this may be to temporarily disable annotation for a specific class, or change the domain of the corresponding slot to string
so it doesn't ground - then run again and see if the process is faster.
There are some ways to further tune the schema, plus alternative strategies for annotation, and most of them come down to using smaller sets of potential identifiers (e.g., you may be able to use a slim version of NCBITaxon if you're primarily expecting taxons to be grasses)
The other thing is probably another issue, maybe one that happens when the output stream isn't the expected type.
Also depends on how your grounding: semsql, bioportal, Gilda…
I believe there may be an embarrassingly simple optimization for grounding the naive strategy may be reindexing each invocation
On Wed, Mar 27, 2024 at 5:50 PM Harry Caufield @.***> wrote:
Wow, that time is much longer than I would expect. It may be because NCBITaxon and PR are two very large ontologies, and they're used as annotators in multiple classes in your schema. A quick check for this may be to temporarily disable annotation for a specific class, or change the domain of the corresponding slot to string so it doesn't ground - then run again and see if the process is faster. There are some ways to further tune the schema, plus alternative strategies for annotation, and most of them come down to using smaller sets of potential identifiers (e.g., you may be able to use a slim version of NCBITaxon if you're primarily expecting taxons to be grasses)
— Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/ontogpt/issues/352#issuecomment-2024212353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOI7YF4F2AKOTMCANPLY2NSMPAVCNFSM6AAAAABFDUNI32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGIYTEMZVGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
A quick check for this may be to temporarily disable annotation for a specific class, or change the domain of the corresponding slot to
string
so it doesn't ground - then run again and see if the process is faster.
I'll give this a shot when I get the chance!
There are some ways to further tune the schema, plus alternative strategies for annotation, and most of them come down to using smaller sets of potential identifiers (e.g., you may be able to use a slim version of NCBITaxon if you're primarily expecting taxons to be grasses)
This is also a great suggestion, I am expecting them basically all to be plant species.
I haven't taken an in-depth look at the code for grounding yet, is there a way to save grounding for last and then only do it on unique entities? In the past with other algorithms something I've done is disabled grounding until I've gotten all the entities, then removed any duplicates to reduce the number of times the grounding process has to run.
Also depends on how your grounding: semsql, bioportal, Gilda…
@cmungall what grounding approach am I using if my annotators are all prefixed with sqlite:obo:
, and where I should go to read more about it? I'm relatively new to the ontology stuff.
I believe there may be an embarrassingly simple optimization for grounding the naive strategy may be reindexing each invocation
Could you elaborate on this?
Annotators prefixed with sqlite:
mean they use pre-built Semantic SQL versions of an ontology - more details here: https://github.com/INCATools/semantic-sql
Those are already one of the faster options.
As for the other grounding strategies, they're described in the OAK docs here: https://incatools.github.io/ontology-access-kit/packages/implementations/index.html and in some detail in the ontogpt docs here: https://monarch-initiative.github.io/ontogpt/custom/ (see section The classes) OntoGPT can essentially use any of the OAK adapters, though in the context of this issue, some will be much slower than others.
Ok so I did a closer read of the paper as well as looking at the implementation of grounding here, and took a cursory skim through the semantic-sql docs.
Annotators prefixed with sqlite: mean they use pre-built Semantic SQL versions of an ontology - more details here: https://github.com/INCATools/semantic-sql Those are already one of the faster options.
In this case, since they're already faster and provide the ontologies I want, I'm planning to stick with the Semantic SQL annotators.
I've got a few clarifying questions that I couldn't find answers for.
In any case, I'd super appreciate some practical pointers on where to begin optimizing, whether that be changes to the code (is there a place for parallelization here?) or semantic-sql-related ontology changes.
EDIT: Forgot to mention that I made an equivalent schema with no annotators, and it was many many times faster, but as noted in the paper, has absolutely atrocious performance. So it's definitely the grounding side of things causing the issue.
Does grounding involve any LLM calls? In the paper it's not clear to me whether that's the case or if you just compared LLM-based grounding to your approach, but in the code I don't see any obvious LLM calls, so I wanted to make sure my understanding was correct.
Nope! The grounding process does not involve the LLM at all and the LLM is never aware of the parameters you use for grounding unless you explicitly tell it (e.g., if the description for a class says "these should be Gene Ontology terms" or something). This is intentional since LLMs are prone to hallucinating completely nonexistent IDs and misrepresenting the connections between terms and IDs.
I wanted to ask about this line from the paper: “[using LinkML] allows for a full representation of the necessary schema elements while incorporating LinkML’s powerful mechanism for specifying static and dynamic value sets. For example, a value set can be constructed as a declarative query of the form “include branches A, B and C from ontology O1, excluding sub-branch D, and include all of ontology O2”.” You had mentioned earlier that paring down the ontologies I'm using would speed things up. Is this value set construction how I would do that? I'm not totally sure if that's the case, since the schema I'm using is something I designed, and I'm just calling these other ontologies as annotators.
I'm not certain if this will provide major performance boosts but it's worth a try.
For the class to use the value set with, use the slot_usage
slot, id
, and values_from
, then set the values_from
to an enum.
The enum can use the reachable_from
slot to define the value set as including all child terms of the listed identifiers.
There are some examples in the ctd_ner schema:
Disease:
is_a: NamedEntity
annotations:
annotators: "sqlite:obo:mesh, sqlite:obo:mondo, sqlite:obo:hp, sqlite:obo:ncit, sqlite:obo:doid, bioportal:meddra"
prompt.examples: cardiac asystole, COVID-19, Headache, cancer
# For the purposes of evaluating against BC5CDR, we force normalization to MESH
id_prefixes:
- MESH
slot_usage:
id:
pattern: "^MESH:[CD][0-9]{6}$"
values_from:
- MeshDiseaseIdentifier
enums:
...
MeshDiseaseIdentifier:
reachable_from:
source_ontology: obo:mesh
source_nodes:
- MESH:D001423 ## Bacterial Infections and Mycoses
- MESH:D001523 ## Mental Disorders
- MESH:D002318 ## Cardiovascular Diseases
- MESH:D002943 ## Circulatory and Respiratory Physiological Phenomena
- MESH:D004066 ## Digestive System Diseases
- MESH:D004700 ## Endocrine System Diseases
- MESH:D005128 ## Eye Diseases
- MESH:D005261 ## Female Urogenital Diseases and Pregnancy Complications
- MESH:D006425 ## Hemic and Lymphatic Diseases
- MESH:D007154 ## Immune System Diseases
- MESH:D007280 ## Disorders of Environmental Origin
- MESH:D009057 ## Stomatognathic Diseases
- MESH:D009140 ## Musculoskeletal Diseases
- MESH:D009358 ## Congenital, Hereditary, and Neonatal Diseases and Abnormalities
- MESH:D009369 ## Neoplasms
- MESH:D009422 ## Nervous System Diseases
- MESH:D009750 ## Nutritional and Metabolic Diseases
- MESH:D009784 ## Occupational Diseases
- MESH:D010038 ## Otorhinolaryngologic Diseases
- MESH:D010272 ## Parasitic Diseases
- MESH:D012140 ## Respiratory Tract Diseases
- MESH:D013568 ## Pathological Conditions, Signs and Symptoms
- MESH:D014777 ## Virus Diseases
- MESH:D014947 ## Wounds and Injuries
- MESH:D017437 ## Skin and Connective Tissue Diseases
- MESH:D052801 ## Male Urogenital Diseases
- MESH:D064419 ## Chemically-Induced Disorders
There are some good examples in the cell_type and gocam templates, too.
I suspect the size of the NCBITaxon annotator isn't helping.
You could try swapping out the taxon annotator for the general NCBI taxon slim (https://raw.githubusercontent.com/obophenotype/ncbitaxon/master/subsets/taxslim.obo) - use pronto:taxslim.obo
instead of sqlite:obo:ncbitaxon
. The taxslim.obo file should go in the root wherever you are running ontogpt from.
This may unfortunately be even slower since this is not a sqlite. It is, however, easier to edit, so you could just cut it down to the species/taxons you want to match, and end up with something like:
format-version: 1.2
synonymtypedef: anamorph "anamorph"
synonymtypedef: blast_name "blast name"
synonymtypedef: equivalent_name "equivalent name"
synonymtypedef: genbank_acronym "genbank acronym"
synonymtypedef: genbank_anamorph "genbank anamorph"
synonymtypedef: genbank_common_name "genbank common name"
synonymtypedef: genbank_synonym "genbank synonym"
synonymtypedef: in_part "in-part"
synonymtypedef: OMO:0003003 "layperson synonym"
synonymtypedef: OMO:0003006 "misspelling"
synonymtypedef: OMO:0003007 "misnomer"
synonymtypedef: OMO:0003012 "acronym"
synonymtypedef: scientific_name "scientific name"
synonymtypedef: synonym "synonym"
synonymtypedef: teleomorph "teleomorph"
ontology: ncbitaxon/subsets/taxslim
[Term]
id: NCBITaxon:1
name: root
namespace: ncbi_taxonomy
synonym: "all" RELATED synonym []
xref: GC_ID:1
xref: PMID:30365038
xref: PMID:32761142
[Term]
id: NCBITaxon:28909
name: Cynodon dactylon
namespace: ncbi_taxonomy
synonym: "Bermuda grass" EXACT genbank_common_name []
xref: GC_ID:1
is_a: NCBITaxon:1
property_value: has_rank NCBITaxon:species
Please post your full schema and we'll see if there are some other areas to optimize.
This is great, thank you!
I'll hold off on adding specific identifiers to the schema until I've exhausted other options since it may not be a huge boost anyway. I'll try paring down the NCBI taxonomy early next week and let you know how it goes!
This is my current schema:
---
id: http://w3id.org/ontogpt/desiccation
name: desiccation
title: desiccationTemplate
description: A template for extracting desiccation related molecular entities and relations
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
linkml: https://w3id.org/linkml/
desiccation: http://w3id.org/ontogpt/desiccation
default_prefix: desiccation
default_range: string
imports:
- linkml:types
- core
classes:
EntityContainingDocument:
tree_root: true
is_a: NamedEntity
attributes:
genes:
range: Gene
multivalued: true
description: A semicolon-separated list of genes.
proteins:
range: Protein
multivalued: true
description: A semicolon-separated list of proteins.
molecules:
range: Molecule
multivalued: true
description: A semicolon-separated list of molecules.
organisms:
range: Organism
multivalued: true
description: A semicolon-separated list of taxonomic terms of living things.
gene_gene_interactions:
range: GeneGeneInteraction
multivalued: true
description: A semicolon-separated list of gene-gene interactions.
gene_protein_interactions:
range: GeneProteinInteraction
multivalued: true
description: A semicolon-separated list of gene-protein interactions.
gene_organism_relationships:
range: GeneOrganismRelationship
multivalued: true
description: A semicolon-separated list of gene-organism relationships.
protein_protein_interactions:
range: ProteinProteinInteraction
multivalued: true
description: A semicolon-separated list of protein-protein interactions.
protein_organism_relationships:
range: ProteinOrganismRelationship
multivalued: true
description: A semicolon-separated list of protein-organism relationships.
gene_molecule_interactions:
range: GeneMoleculeInteraction
multivalued: true
description: A semicolon-separated list of gene-molecule interactions.
protein_molecule_interactions:
range: ProteinMoleculeInteraction
multivalued: true
description: A semicolon-separated list of protein-molecule interactions.
Gene:
is_a: NamedEntity
id_prefixes:
- GO
annotations:
annotators: sqlite:obo:go
prompt: |-
the name of a gene.
Examples are Oropetium_20150105_12014, AT2G21490.
Protein:
is_a: NamedEntity
id_prefixes:
- PR
annotations:
annotators: sqlite:obo:pr
Molecule:
is_a: NamedEntity
id_prefixes:
- CHEBI
annotations:
annotators: gilda:, sqlite:obo:chebi
Organism:
is_a: NamedEntity
id_prefixes:
- NCBITaxon
annotations:
annotators: sqlite:obo:ncbitaxon
prompt: |-
the name of a taxonomic name or species.
Examples are Bacillus subtilus, Bos taurus, blue whale.
GeneGeneInteraction:
is_a: CompoundExpression
attributes:
gene1:
range: Gene
annotations:
prompt: the name of a gene.
gene2:
range: Gene
annotations:
prompt: the name of a gene that interacts with gene1. Interactions
can include genetic interactions, or genes whose expression has an
effect on another gene.
GeneProteinInteraction:
is_a: CompoundExpression
attributes:
gene:
range: Gene
annotations:
prompt: the name of a gene.
protein:
range: Protein
annotations:
prompt: the name of a protein that interacts with the gene.
Interactions can include physical interactions, like a transcription
factor that binds to a gene promoter to affect expression, or indirect
interactions, like when the action of a protein somewhere else in the
cell impacts the expression of a gene.
GeneOrganismRelationship:
is_a: CompoundExpression
attributes:
gene:
range: Gene
annotations:
prompt: the name of a gene.
organism:
range: Organism
annotations:
prompt: the name of an organism to which the gene belongs.
ProteinProteinInteraction:
is_a: CompoundExpression
attributes:
protein1:
range: Protein
annotations:
prompt: the name of a protein.
protein2:
range: Protein
annotations:
prompt: the name of a protein that interacts with protein1. An example
of an interaction is one protein binding another, or a protein
phosphorylating another.
ProteinOrganismRelationship:
is_a: CompoundExpression
attributes:
protein:
range: Protein
annotations:
prompt: the name of a protein.
organism:
range: Organism
annotations:
prompt: the name of an organism to which the protein belongs.
GeneMoleculeInteraction:
is_a: CompoundExpression
attributes:
gene:
range: Gene
annotations:
prompt: the name of a gene.
molecule:
range: Molecule
annotations:
prompt: the name of a molecule that interacts with a gene. Examples
include a methyl group being added to a segment of DNA as part of
methylation.
ProteinMoleculeInteraction:
is_a: CompoundExpression
attributes:
protein:
range: Protein
annotations:
prompt: the name of a protein.
molecule:
range: Molecule
annotations:
prompt: the name of a molecule that interacts with the protein. An
example of a protein-molecule interaction is when an allosteric
inhibitor binds an enzyme's allosteric site.
I'm not actually sure that I need gilda in the Molecule
class, but I was following an example from another template so I left it in there.
Any other suggestions appreciated!
I tried just running the verbatim version of taxslim.obo
that you provided, and damn what a speedup!!
real 1m7.614s
user 0m57.557s
sys 0m27.794s
real 1m6.706s
user 0m57.438s
sys 0m27.624s
real 1m6.617s
user 0m57.798s
sys 0m28.672s
real 1m6.654s
user 0m57.600s
sys 0m23.323s
Anecdotally too it looks like the performance is basically the same for extracting the species in the example abstract I've been using.
While that is great news, I do still have an issue related to timing: even running at 1min per abstract, it would take 55 days to run this code. Having looked at the grounding code, I feel like there is definitely a way to speed it up internally, in terms of parallelization. Have you all thought about parallelizing this section of code and decided against it because of some kind of barrier, or is it an open problem that I could try my hand at a PR for?
EDIT: Realize that I haven't tried paring down the taxonlite.obo
file even further; however, not sure how much performance gain I can expect there, since it seems like it's already pretty slim?
Great! Thanks for providing the schema.
I'm 110% sure there are ways to speed up the grounding, so if you feel inspired, a PR is welcome!
Using a slim version of NCBI taxon in its OBO form may not be the fastest option - but in this case you would have to get it in OWL format and then convert to semantic-sql database (like here: https://github.com/INCATools/semantic-sql?tab=readme-ov-file#creating-a-sqlite-database-from-an-owl-file).
PR and CHEBI are also larger ontologies, so there may be some speedup to be had in using other or smaller versions of those annotators.
It definitely got faster after turning the slim Taxonomy into an sqlite database! Cut off about 10 seconds.
ChEBI has a Lite version of the ontology, which I made into an sqlite database and used. However, there was no speedup -- I think this is because Lite in the ChEBI case refers to the data associated with each instance in the ontology, but there are just as many terms, so there's no speedup.
I think I'm going to turn my attentions towards optimizing the code itself as opposed to the databases I'm using for normalization -- it seems generally advantageous to do that in any case.
Thanks again for all your help, I'll open a PR when I have something to show!
Hi all,
Just wanted to update on this and ask some questions.
Before spending time optimizing, I decided to make sure that switching to the slim ontology didn't affect performance too badly. Unfortunately, on a sample of 1,000 docs from my dataset, switching to slim results in a loss of about 50% of groundings, as well as completely dropping 20% of entities entirely. So for my use case, optimizing performance with slim ontologies doesn't seem to be sufficient. I noticed that you opened #363, which might help, but since optimizing the schema with the slim taxonomy helped so drastically in terms of time, I'm not sure that I'd ever be able to use the full taxonomy, which may be a dealbreaker for being able to use the tool. So for the moment, I'm going to hold off on doing any optimization of the grounding code itself.
I also noticed while quantifying the outputs of the two graphs that the relation extraction performance, regardless of which taxonomy DB I used, is absolutely abysmal. I don't have a gold standard for this dataset, but just anecdotally speaking, for a dataset of 1,000 docs, only ~700 relations were extracted. I added specific prompts to each relation in the schema before running this analysis, so I'm not sure what else I can do to get better relation extraction performance. Wondering if you have any thoughts -- I looked for similar issues but didn't find any that specifically talked about engineering the relation prompts within the schema, so let me know if I should be opening a separate issue for this.
Hi @serenalotreck - thanks for your patience, and thanks for looking into some areas for performance improvements! NCBITaxon is just so huge that you may see some benefit from removing the parts you definitely aren't interested in - or merging some into the Plant Ontology (it already has a small chunk of NCBITaxon but I'm not sure why: https://bioportal.bioontology.org/ontologies/PO/?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCBITaxon_131567)
As for relation extraction results, a couple things you could try:
Provide more explicit examples for both the entities and the relations
Include a high-level generic relation class and include it as a slot on the EntityContainingDocument
class; this will essentially extract every association (within reason, since LLMs are rarely as comprehensive as may be preferable) so you can see if they're getting detected but not sorted properly. You can even use the core schema's Triple
class, which has these slots on it already:
Triple:
abstract: true
description: Abstract parent for Relation Extraction tasks
is_a: CompoundExpression
attributes:
subject:
range: NamedEntity
predicate:
range: RelationshipType
object:
range: NamedEntity
qualifier:
range: string
description: >-
A qualifier for the statements, e.g. "NOT" for negation
subject_qualifier:
range: NamedEntity
description: >-
An optional qualifier or modifier for the subject of the statement, e.g. "high dose" or "intravenously administered"
object_qualifier:
range: NamedEntity
description: >-
An optional qualifier or modifier for the object of the statement, e.g. "severe" or "with additional complications"
I've anecdotally found it useful to include some details about all the necessary details for a complex expression in the first slot where it's referenced, e.g., in the dietitian_notes
schema, this slot on the root class collects details about medications:
medications:
description: >-
A semicolon-separated list of the patient's medications.
This should include the medication name, dosage, frequency,
and route of administration. Relevant acronyms: PO: per os/by mouth,
PRN: pro re nata/as needed. 'Not provided' if not provided.
range: DrugTherapy
multivalued: true
Then this is the entity definition:
DrugTherapy:
is_a: CompoundExpression
annotations:
owl: IntersectionOf
attributes:
drug:
description: >-
The name of a specific drug for a patient's preventative
or therapeutic treatment.
range: Drug
amount:
description: >-
The quantity or dosage of the drug, if provided.
May include a frequency.
N/A if not provided.
range: QuantitativeValueWithFrequency
dosage_by_unit:
description: >-
The unit of a patient's properties used to determine drug
dosage. Often "kilogram". N/A if not provided.
range: Unit
duration:
description: >-
The duration of the drug therapy, if provided.
N/A if not provided.
range: QuantitativeValue
route_of_administration:
description: >-
The route of administration for the drug therapy, if provided.
N/A if not provided.
range: string
To be fair, these details are usually provided adjacent to each other, unlike many relations like protein-protein interactions (except in those ideal cases like "protein A interacts with protein B"). But this kind of prompt engineering appears to help with relation extraction.
I have a corpus of ~75,000 abstracts that I want to make a KG out of using OntoGPT. After 4 hours, it only got through 50 documents -- not super promising!
I took a look through the docs to see if there was a parallelization option, but didn't find anything -- is there a better way to run OntoGPT over tons of documents besides making a bunch of separate small directories and submitting a bunch of different jobs?
If you have a thought about where in the code it would make sense to add parallelization capabilities I'm happy to give a shot at opening a PR!