monarch-initiative / embiggen

🍇 Embiggen is the Python Graph Representation learning, Prediction and Evaluation submodule of the GRAPE library.
BSD 3-Clause "New" or "Revised" License
41 stars 12 forks source link

Resnik computations getting stuck #296

Open caufieldjh opened 2 years ago

caufieldjh commented 2 years ago

In the continued adventures of Resnik - In semsim, I've found that trying to run Resnik computation on KGPhenio seems to get stuck. The DAG is 49291 nodes, including Upheno nodes, as we can't get paths between phenotype ontology nodes without them.

With code as follows:

prefixes = ["HP","MP"]
cutoff = 2.5
resnik_model = DAGResnik()
resnik_model.fit(dag, node_counts=counts)
rs_df = resnik_model. \
    get_similarities_from_bipartite_graph_from_edge_node_prefixes(
        source_node_prefixes=prefixes,
        destination_node_prefixes=prefixes,
        minimum_similarity=cutoff,
        return_similarities_dataframe=True,
    ).astype("category", copy=True)

will consume as much memory as is available without actually completing. I tried this out on a cloud instance with 128 GB memory today and the process got killed due to running out of memory. It took ~4 hrs of continuous use of 100% of 16 vCPUs, with a peak of around 55 GB, then increased to >120 GB within about 10 more minutes.

Is the Resnik calculation getting stuck in the DAG somewhere?

I've previously been able to get some output from the function, but only with a previous version, so I wasn't able to specify a minimum_similarity in that case.

Embiggen is 0.11.38, ensmallen is 0.8.24.

@hrshdhgd @justaddcoffee

LucaCappelletti94 commented 2 years ago

Roughly how many edges are you expecting to receive?

pnrobinson commented 2 years ago

One optimization I have made in our Java code reflects the fact that if we start if an ontology that has subontologies that do not intermingle, you do not need to explicitly calculate the IC of terms where you know their MICA the root (e.g., liver and ear). This results in a large saving. Luca, can we do a code review and figure out if this might make sense here?

justaddcoffee commented 2 years ago

Roughly how many edges are you expecting to receive?

For this experiment (HP versus MP phenotypes), I think there are roughly 49k nodes and 93k edges so not particularly large.

So, a memory peak of >120 GB when computing the all X all Resnik similarity and only storing things above a fairly high cutoff (>2.5 IC I think) is kind of surprising to me...

This results in a large saving. Luca, can we do a code review and figure out if this might make sense here?

Ping me too please! I'd like to sit in