Semsiman calculation taking huge amout of time and resources

monarch-initiative / semsimian

Simple rust implementation of semantic similarity

BSD 3-Clause "New" or "Revised" License

8 stars 5 forks source link

Semsiman calculation taking huge amout of time and resources #115

Open souzadevinicius opened 9 months ago

souzadevinicius commented 9 months ago

I'm trying to calculate semantic similarity profiles using Phenio ontology comparing different term sets

HPxHP
HPxMP
HPxZP

HP term set: 17097 entries
MP term set: 13809 entries
ZP term set: 39373 entries

Ontology used: Phenio Library versions

semsimian                  0.2.11
oaklib                     0.5.24

command line execution example:

runoak --stacktrace -vvv  -i semsimian:sqlite:phenio-monarch.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file hp_terms.txt \
--min-ancestor-information-content 4.0 \
--min-jaccard-similarity 0 \
--autolabel \
-O csv \
-o phenio-monarch-hp-hp.0.semsimian.tsv

I tried to run these experiments locally (32 and 64 GB RAM machines) and in a HPC (writing output process took more than one week and then was killed)

caufieldjh commented 9 months ago

One week, oof! I have previously been able to complete a very similar PHENIO HP vs MP on a GCloud instance with < 64 GB memory, though it did consume a lot of that resource. What kind of resource usage do you see with higher thresholds, like a min of 10 for AIC and 0.4 for Jaccard? The labeling can also consume a surprisingly large amount of resources and is very redundant for this sort of comparison, so I'd suggest dropping that parameter and mapping CURIEs to labels after the comparison is complete.

matentzn commented 9 months ago

@souzadevinicius lets try a 0.4 Jaccard threshold and removing the labelling options and see if that makes it at least possible to run HP-ZP

souzadevinicius commented 8 months ago

@souzadevinicius lets try a 0.4 Jaccard threshold and removing the labelling options and see if that makes it at least possible to run HP-ZP

cmungall commented 7 months ago

what's the status of this?

justaddcoffee commented 7 months ago

Discussing in the MWF hackathon now

We were thinking we would deploy semsimian/oak on our build server and run on a regular cadence. This way we have an objective measure of how much memory/time we are talking about here, and we can also emit a new artifact with a PURL so people can use this downstream.

@caufieldjh perhaps we already have a repo to do this?

justaddcoffee commented 7 months ago

Ah okay, Harry has already made a repo for this here

hrshdhgd commented 7 months ago

Sorry I'm a little late to the party but @souzadevinicius , did you run this without --autolabel or specify --no-autolabel? Just to get an idea how fast it'll be.

justaddcoffee commented 7 months ago

The last build in Aug '23 took 1h and 18m.

Sorry I'm a little late to the party but @souzadevinicius , did you run this without --autolabel or specify --no-autolabel? Just to get an idea how fast it'll be.

Yep good question @hrshdhgd

Harry says a previous build with auto-label turned on took 15h so this might be at least one thing that is slowing down Vinicius's run

caufieldjh commented 7 months ago

Note that the Jenkins build performed by that repo takes a bit over 1 hr without autolabel and ~15 hrs w/ autolabel.

caufieldjh commented 7 months ago

For reasons not entirely clear to me, this build took 3 hours. Here's the command:

runoak -i semsimian:sqlite:obo:phenio similarity --no-autolabel -p i --set1-file HPO_terms.txt --set2-file MP_terms.txt -O csv -o HP_vs_MP_semsimian.tsv --min-ancestor-information-content 4.0

That's with:

semsimian-0.2.11
oaklib-0.5.25

The product: http://kg-hub-public-data.s3.amazonaws.com/monarch/HP_vs_MP_semsimian.tsv.tar.gz