Parallelize warmup point generation in sampling

MuhammedHasan commented 6 years ago

Nowadays, I am working on sampling with recon2. Since sampling is an expensive operation and recon2 model is a pretty big model, I am benchmarking and looking for any possible optimization on sampling code with smaller networks.

At the moment, I am a bit confused because generation warmup points get more time than FVA analysis as shown in the image below.

screenshot from 2017-12-28 22-19-50

The paper explains warm point generation in the following quote: "We propose to combine a part of the warm-up phase of gpSampler and ACHR. The resulting algorithm is called optGpSampler (see Table 1. Algorithm 1). First, 2n warm-up points are generated using part (1) of gpSampler’s warm-up procedure by successively minimizing and maximizing the flux through each reaction."

What I understand from this is FVA, but when I read the cobrapy implementations, generate_fva_warmup only maximizes each reaction's forward and reverse variables rather than just calling FVA method (I think this is due to obtaining all solution in addition to min-max values but developers may generalize FVA anyway). The problem is, this implementation gets 7-8x more time than FVA analysis and makes initialization very slow.

Moreover, multithreading is not helping much on big models because all cores are idle until end of the initialization step and initialization may dominate sampling on big models. Developers may implement multithreading for initialization steps too.

In addition to those problems, developers may use multiprocessing popular libraries like njoblib rather than using python multithreading api. This does not only make lives of the developers makes easier but also can improve performance.

cdiener commented 6 years ago

The time difference is due to the fact that warmup point generation needs to get all primal values from the solver in each iteration and not just one particular flux. Since solving the model is really fast transporting arrays between the C interface and Python becomes the bottleneck here. This is the difference you see. The array with the primal values is already assembled in C but Python object creation still has an overhead.

However, after creating the sampler instance you can pickle the instance and load it later to skip the initialization step. Moreover for Recon2 you would need to sample a lot if points to get a representative sample (1e6 - 1e9) and that would probably take longer than the initialization step.

Parallelization of the unit step is on our roadmap but it is not the most urgent one I am afraid.

Regarding joblib I think the question is how it would specifically make our code easier. We do have some problems with Pythons multiprocessing API but there are not solved by joblib AFAIK. So adding another heavy dependency might not be the best solution. PRs or just some code examples on gitter are very welcome though :)

MuhammedHasan commented 6 years ago

Pickling does not help because I need to run the model for different constraints; thus, I need to reinitialize model for multiple time.

Moreover for Recon2, when I sample 1e5 points, initialization gets 1k sec and sampling 11k sec with a single core. But I have 24 core in the machine; thus, sampling is expected to get less time than initialization for a good parallelization. I cannot run sampling for bigger sampling sizes due to memory and storage limitations and sampling code throws a recursion depth limit error for bigger sampling size.

Joblib automatically manages most of the parallelization work. For example, it optimizes communication by creating shared memory between processes , creates pools, joins results with very nice syntax.

If I find time from my research, I hope I will send PRs.

Thanks for your help,

cdiener commented 6 years ago

Hi could you share which exact constraints give you the recursion error? Also which version if Recon 2 (2.00-2.04, 2.1, 2.2)? Does joblib automatically convert objects into shared memory, like any Python object?

MuhammedHasan commented 6 years ago

I did not log the constraint; thus, I cannot share it. My Recon version is 2.2.

Yes, joblib can convert python object to shared memory. Saying all objects will be an overconfident statement. But its management of large numpy arrays is nice.

MuhammedHasan commented 6 years ago

Hey, I start to work my fork of cobrapy. You may review and give me advice but keep in mind that it is not the final version.

cdiener commented 6 years ago

Hi, thanks. I took a look and agree that the shared memory is simpler that way. However, reading up on it in the joblib docs it seems that joblib manages shared memory by using numpy memmaps and writing to in-memory file under linux but a real file on windows. the old version always managed the shared memory directly in RAM. So this might have performance consequences on Windows. Do you have some benchmarks of your fork compared to the old version? The rest does not really seem to take less code with joblib than multiprocessing, so it would depend on the benchmarks I think.

Why do you need to specify the chunks for the FVA part? Since it is a Pool shouldn't joblib be able to manage distribution of the individual tasks to the processes by itself? At least multiprocessing.Pool does that by default...

cdiener commented 6 years ago

Adjusting title and marking this as a feature we should add since sampling itself is already parallel so it makes sense that warmup point generation should be too.

Midnighter commented 6 years ago

@cdiener is this something you'd still like to do? Otherwise I'd add a "help-wanted" label and advertise it.

cdiener commented 6 years ago

Don't really have bandwidth right now, so advertising it would be good.

opencobra / cobrapy

Parallelize warmup point generation in sampling #647