Inference on a current version of EFO

yonromai commented 11 months ago

Hi @dhimmel @eric-czech

I'm going to work on this ^ today as it's one of the last main tasks needed before the MONDO workshop.

I realized I have a few questions regarding the details of the task:

We need to decide which specific model we'd like to use for the inference. I see 2 main contenders:
- "Topological features + Text embeddings PCA64": The best performing model, most Occam's razor compliant. Would probably be my top choice.
- "All features (Topological + PCA 64 + LDA + KNN + GPT + Subsets)": Would have data on all features which we can talk about during the workshop, show feature importance, ...
- IIRC @dhimmel: You mentioned you'd like a dump of the feature values for all the nodes. Would you like to have these formatted as:
- 1 files (e.g. CSV with ID, Label and one column per feature)
- 2 files (CSV with ID and label and another CSV ID, features columns).
- (The 2 files approach might be a little cleaner since there are 80+ features. Also the resulting feature file will likely be a little large and hard to compress (lots of random looking floats) & open in a web browser)
- In terms of the labels predicted:
- Do we want the 3 classes or the 4 classes variant (including the non-disease label)?
- Which labels do we want to output: ["01-disease-subtype", "02-disease-root", "03-disease-area"] or ["low", "medium", "high"]
- Also the label output: Do you want only 1 column with the top label or 1 column per label proba + 1 column for the top label
- Just to confirm: This version of the ontology is the one we want to run inference on: "https://github.com/related-sciences/nxontology-data/raw/2ce01d8495024d46cbc54fb0c26a92500ad717e0/efo_otar_slim.json"

Thanks for the help!

eric-czech commented 11 months ago

Would probably be my top choice.

Me too, let's use that one ("Topological features + Text embeddings PCA64").

Do we want the 3 classes or the 4 classes variant

The 3 class version is ideal. That 4th class (non-disease) can always be a pre or post-hoc processing detail and it was essentially a sentinel value rather than using a NULL. I'd prefer to just use NULL/None wherever necessary here instead.

Which labels do we want to output: ["01-disease-subtype", "02-disease-root", "03-disease-area"] or ["low", "medium", "high"]

The ["low", "medium", "high"] sound good to me.

Do you want only 1 column with the top label or 1 column per label proba + 1 column for the top label

Definitely 1 column per label proba + 1 column for the top label.

eric-czech commented 11 months ago

This version of the ontology is the one we want to run inference on: "https://github.com/related-sciences/nxontology-data/raw/2ce01d8495024d46cbc54fb0c26a92500ad717e0/efo_otar_slim.json"

That's correct. That link is for EFO version v3.57.0 which we will be using as well.

dhimmel commented 11 months ago

We need to decide which specific model we'd like to use for the inference

So the GPT and subset features weren't improving the model!! 🤯

Do we want the 3 classes or the 4 classes variant (including the non-disease label)?

I think we should be calculating non-disease on the version of EFO loaded and include that as a column. Those terms don't receive a prediction.

This version of the ontology is the one we want to run inference on

In general, we want to be using the latest version, but fine to hard code the commit hash for now until we set up a continuous build job on CI to reprocess latest every month.

yonromai commented 11 months ago

So the GPT and subset features weren't improving the model!! 🤯

Indeed, it appears that the GPT and subset features don't provide much additional knowledge compared to the text embedding vectors. If you want more insights into how much info is actually contained within the GPT and subset features, I can train and export feature importance for models without the text embedding features.

I think we should be calculating non-disease on the version of EFO loaded and include that as a column. Those terms don't receive a prediction.

Sounds good. Could you provide the node attributes & values that I can use to filter out the non-disease node? 🙏

In general, we want to be using the latest version, but fine to hard code the commit hash for now until we set up a continuous build job on CI to reprocess latest every month.

I see. Running the model in an automated (& remote) fashion will require a little more setup than running it on my machine.

dhimmel commented 11 months ago

Could you provide the node attributes & values that I can use to filter out the non-disease node?

Based on Eric's private slack message, for each node call the roots method to find all top-level ancestors of that term. If all roots are in the following set of non-disease therapeutic areas (aka roots in EFO OTAR Slim), then its non-disease:

term_label, term_id
animal_disease, EFO:0005932
measurement, EFO:0001444
phenotype, EFO:0000651
biological_process, GO:0008150
medical_procedure, EFO:0002571

yonromai commented 11 months ago

Based on Eric's private slack message, for each node call the roots method to find all top-level ancestors of that term. If all roots are in the following set of non-disease therapeutic areas (aka roots in EFO OTAR Slim), then its non-disease:

Thanks for the help!

In order to sanity check my logic:

NON_DISEASE_THERAPEUTIC_AREAS: set[str] = {
    "EFO:0005932",  # animal_disease
    "EFO:0001444",  # measurement
    "EFO:0000651",  # phenotype
    "GO:0008150",  # biological_process
    "EFO:0002571",  # medical_procedure
}

def get_disease_nodes(
    nxo: NXOntology[str] | None = None,
) -> Iterable[NodeInfo[NodeT]]:
    """
    If all roots are in the set of non-disease therapeutic areas (aka roots in EFO OTAR Slim), 
        then it is non-disease
    """
    nxo = nxo or get_efo_otar_slim()
    for n in sorted(nxo.graph):
        node = nxo.node_info(n)
        if any(root not in NON_DISEASE_THERAPEUTIC_AREAS for root in node.roots):
            yield node

I compared the count of disease nodes and total nodes in the latest EFO graph vs. in the training set:

Nodes in nxo: 25209
Disease nodes in nxo: 14727
Nodes in training set: 20380
Disease nodes in training set: 14353

So FWIW, it seems that of the nodes that were added in the newer version of the ontology, most of them are non-disease nodes.

dhimmel commented 11 months ago

Noting the initial classifications in efo_otar_slim_v3.57.0_precisions.tsv.

related-sciences / nxontology-ml

Inference on a current version of EFO #30