CiaranWelsh commented 3 years ago

As pointed out by @matthiaskoenig here and @luciansmith here we should have a benchmarking test suite.

This point has been raised in other issues, but I'm repeating it here to bring it to the foreground.

CiaranWelsh commented 3 years ago

@luciansmith @hsauro @matthiaskoenig opinions on how best to do this?

I'm thinking of creating two separate performance test suites, one that measures performance of roadrunner when running many individual simulations with a single model and another that measures performance of building many models. For the "many simulations" suite, I'm thinking a panel of 10 or so models (but not sure which?) and do 100K deterministic simulations. For the "many models" suite, I'm thinking of measuring the time it takes to build the curated section of biomodels.

I've got a couple of issues to fix with the save/load state in the new Jit compiler but will move on to this after I've fixed them - so early next week sometime.

matthiaskoenig commented 3 years ago

In addition I would add models of increasing size based on certain patterns to ensure scaling of times with model size. E.g.

linear chain of length n (varying n)
n reactions (uncoupled)
network of n nodes with increasing number of reactions between nodes Most of these should scale linearly in runtimes/load times or (at worst case quadratic). This would allow benchmarking runtimes but also the scaling for larger models.

I could contribute some single large models to the test suite which I would be interested to run as fast as possible (zonated liver metabolism with many reactions and species).

hsauro commented 3 years ago

This is a good idea. Some big models would be very helpful.

H

On Fri, Nov 5, 2021 at 7:34 AM Matthias König @.***> wrote:

In addition I would add models of increasing size based on certain patterns to ensure scaling of times with model size. E.g.

linear chain of length n (varying n)

n reactions (uncoupled)

network of n nodes with increasing number of reactions between nodes Most of these should scale linearly in runtimes/load times or (at worst case quadratic). This would allow benchmarking runtimes but also the scaling for larger models.

I could contribute some single large models to the test suite which I would be interested to run as fast as possible (zonated liver metabolism with many reactions and species).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sys-bio/roadrunner/issues/899#issuecomment-961944522, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBSDXYIY6YZ2KKVP5RTOLUKPTLNANCNFSM5GVG4U4A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Herbert Sauro, Professor University of Washington, Bioengineering 206-685-2119, www.sys-bio.org @.*** Books: http://books.analogmachine.org/

CiaranWelsh commented 3 years ago

Okay, I'm close to the point where I can push forwards with this now -- @matthiaskoenig if you have a model or two that you can contribute towards a performance test suite I'll be happy to include them, cheers.

Note, linear chain graph currently looks like this if you're interested.

matthiaskoenig commented 3 years ago

@CiaranWelsh I will send the models beginning of next week.

Some comments:

it is sufficient to use some model sizes, e.g. use n = [2, 4, 8, 16, 32, ...] which then makes it easy to see the change in runtime when doubling the system size
it is better to run multiple repeats and use the mean of the runtimes (otherwise you have a lot of noise, e.g. due to cache evictions and whatever)
plot on log-log axes
what is compile time? Is this model load time? 3 seconds seem very long for 300 species

CiaranWelsh commented 3 years ago

what is compile time?

I mean Jit compiling the model to machine code. This involves reading the sbml, generating llvm IR code and then compiling that code to assembly/binary. The result is a set of functions that we have a handle to in LLVMExecutableModel. I agree, 3 seconds is very long, and prohibitive when you have many such large models. I'll try to reduce this if I can.

it is better to run multiple repeats

Yes, Hebert and I discussed this yesterday and came to the same conclusion. However, we thought it best to take the "bottom", not the average.

Good idea with using log space.

Thanks for your input on this Mathias, very helpful.

hsauro commented 3 years ago

It would be nice if we could separate out the SBML loading time from the simulation time. I noticed that loading a model using the direct API was 3 times faster than loading the same model using SBML. SBML is a real bottleneck.

I assume loading a model saved in binary format is much faster?

CiaranWelsh commented 3 years ago

I agree, load time tends to completely dwarf simulation time so separating them out is important. Here's some result for simulation time. Code:

    rr = RoadRunner(sbml)

    times = np.zeros((numRepeats,))

    for i in range(numRepeats):
        # print(i)
        rr.resetAll()
        startTime = time.time_ns()
        data = rr.simulate(0, simEndTime, numSimSteps)
        times[i] = time.time_ns() - startTime

Results: Top row, Hanekom; bottom row, mass action; random generated network using Michaels RNG package; small = 10, medium = 50, large = 100 species. Number of simulation points = 5000 (the small mass action network needed a lot of simulation points in python or a lot of simulation times were just 0!). Each of these models has a steady state. The end simulation time for each model is the "eyeballed" time where model dynamics stop changing. Simulations were repeated 50 times, results shown are averages with error. Shown is 1 model out of 3 for each category.

SBML is a real bottleneck.

I've started using a profiler called "Intel VTune" - seems to be a good choice on windows. With larger models roadrunner spends a lot of time calling sbml's get getElementBySId. The algorithm used is a linear search (in sbml's ListOf), I'm currently experimenting with changing this to binary search to do this in O(log N) time.

CiaranWelsh commented 3 years ago

The load time problem hits hardest when we try to simulate a batch of different models, because they all need to be compiled. One option that mitigates the expertise needed to optimize the build process would be to build a roadrunner container of some kind. The container would allow us to take care of parallel builds and simulations under the hood, that is - building many models using a queue of size $n \in {1, number of cores}$.

A rough API (off the top of my head) would be something like

#include "RoadRunnerVector.h" // any reason to use a list or map instead?
using namespace rr;
RoadRunnerVector rrv("path/to/sbml/directory"); 
// OR
std::vector<std::string> sbmlFiles({sbmlFile1, ...});
RoadRunnerVector rrv(sbmlFiles);

//// access 
// If we use a vector we can use model index for constant access time
rrv[0]->simulate(...);

// If we use a map we get constant access time using model name
rrv["MyModel"]->simulate(...). 

// (might be a C++ version of OrderedDict where we can do both?)

//// simulation 
// We can take care of parallel simulations under the hood
rrv->simulate(...); // simulates all models using `n` cores. 
// similarly with steadystate and/or other important methods.

matthiaskoenig commented 3 years ago

Hi all,

yes load times have to be separated and are the dominating factor for single simulations. For a recent manuscript we did a comparison of SBSCL vs. roadrunner vs COPASI. SBSCL based on JAVA is much faster in loading so that a single simulation is much faster with Java ;). See Section 3.2 for details in this supplement of this paper https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab669/6374495
yes, the state loading is much faster. I am basically only loading an SBML model once and saving the state. As long as the SBML file hash has not changed I use the corresponding state file. This speeds up things dramatically. See example below
selections start to play a crucial role in multi-processing in python, because the data has to be serialized and copied between the cores. I.e. the less selections and time points and the smaller the results matrix the faster the multi-core simulations. For single core simulations this is not so important.

Here some example benchmarks for the loading with a large model

import roadrunner
import time

ts = time.time()
r = roadrunner.RoadRunner("body36_rbc_green2_flat.xml")
te = time.time()
print(f"roadrunner load time: {(te-ts):8.4f} [s]")

ts = time.time()
r.saveState("body36_rbc_green2_flat.xml.state")
te = time.time()
print(f"save state time: {(te-ts):8.4f} [s]")

ts = time.time()
r2 = roadrunner.RoadRunner()
r2.loadState("body36_rbc_green2_flat.xml.state")
te = time.time()
print(f"load state time: {(te-ts):8.4f} [s]")

ts = time.time()
s = r.simulate(start=0, end=100, steps=100)
te = time.time()
print(f"simulation time: {(te-ts):8.4f} [s]")

Results in

roadrunner load time:   5.2288 [s]
save state time:   0.0459 [s]
load state time:   0.1952 [s]
simulation time:   0.0807 [s]

i.e. it takes 5 seconds to load the SBML model which is around ~100x the time for a simulation. Loading from the state is still slower then a simulation but takes around the same time. Model is attached in the zip file body36_rbc_green2_flat.zip]

CiaranWelsh commented 3 years ago

I'm pretty new to reading and interpreting these "flame graphs" -- does this look like we are reading from sbml three times in roadrunner?

VTune profiler on a loading large model, no simulation, 10 times.

summary:

Elapsed Time:   79.663s
    CPU Time:   78.713s
    Total Thread Count: 122
    Paused Time:    0s

edit

Looking into this further, it does indeed look like we are loading the sbml from string into a SBMLDocument 3 times.

hsauro commented 3 years ago

I assume Java is using its own sbml parser. This would mean that libsbml is the problem.

hsauro commented 3 years ago

That’s a nice study. We should ask Lucian how to speed up libsbml.

luciansmith commented 3 years ago

I can confirm that we are indeed reading every model from a string three times in the code. This is a clear place where we'd get a huge speedup by only calling it once!

hsauro commented 3 years ago

Yes that sounds like a good target to look at.

matthiaskoenig commented 3 years ago

SBSCL uses JSBML to parse the model. From the performance I would expect libsbml to be on the same level/speed as JSBML. Especially because many implementations and algorithms are very similar between libsbml and JSBML.

Thing which really could take time in roadrunner:

multiple parsing of the XML (you already are looking into this)
performing validation of the SBML (as far as I remember roadrunner is doing this for models. This is an expensive operation and there should be/is a flag for deactivating this)
the actual LLVM model compilation step (it would be important how much of the total time/cost is spend in SBML parsing and generation of the RoadRunner instance)

luciansmith commented 3 years ago

I can say, having looked into this today, that while multiple parsing was an issue (and still is, though not as badly, for SBML-comp files), I think I now have a fix for that. Validation is not performed on successfully-loaded SBML files; we only validate models that don't successfully load in case that reveals the problem.

The model compilation step may well take an appreciable amount of time, but at least at the macro level, I don't think it's doing anything multiple times. I hope ;-)

matthiaskoenig commented 2 years ago

Thanks. Removing the repeated model loading definitely sped things up. I.e. on the large model this went from 4 seconds to around 3.3 - 3.5 seconds. I.e. more then 10% speedup in model loading. So it seems the complete model parsing in SBML is probably around 300ms, leaving around 3 seconds for the model compilation. So the actual libsbml parsing seems only to be a minor contribution of around 10% to the complete model loading.

sys-bio / roadrunner

Build a new performance benchmarking suite #899

edit