opencobra / optlang

optlang - sympy based mathematical programming language
http://optlang.readthedocs.org/
Apache License 2.0
253 stars 52 forks source link

New class LinearConstraintsMatrix for better model building performance #239

Open axelvonkamp opened 3 years ago

axelvonkamp commented 3 years ago

When using larger models it is becoming increasingly apparent that rerpresenting each constraint as a SymPy expression incurs a significant performance penalty when setting up such models. Although SymPy expressions are very flexible and can also be used to represent non-linear constraints the vast majority of constraints are typically linear and can often be pooled as matrices. In order to pass such matrices efficiently to the solvers I suggest to introduce a new class LinearConstraintsMatrix which can be added to models in a similar fashion as Constraint objects but is really just a wrapper for a collection of constraints in the underlying solver object. In particular, a LinearConstraintsMatrix object would implement methods to directly get/set coefficients in the underlying solver object. A simpler, but less flexible solution, would be to add an abstract method add_lp_matrix to the Model class which can be implemented by the solver-specific interfaces. For CPLEX, this could look like this:

    import numpy as np
    def add_lp_matrix(self, lil_matrix, variables, lhs, rhs):
        # lil_matrix is a sparse matrix in LIL format
        # variables is either a list of names (str) or a list of indices (int)
        # adds constraints lhs <= lil_matrix <= rhs to the model
        if isinstance(variables[0], int):
            var_idx = np.array(variables)
        else:
            var_idx = np.array(self.problem.variables.get_indices(variables))
        senses = ["R"] * len(lhs)
        range_values = [0.] * len(lhs)
        for i in range(len(lhs)):
            if lhs[i] <= -cplex.infinity and rhs[i] >= cplex.infinity: # unbounded constraint
                rhs[i] = cplex.infinity
                senses[i] = "L"
            elif lhs[i] <= -cplex.infinity: # <= rhs[i] constraint
                senses[i] = "L"
            elif rhs[i] >= cplex.infinity: # >= lhs[i] constraint
                senses[i] = "G"
                rhs[i] = lhs[i]
            elif lhs[i] <= rhs[i]: # lhs[i] <= ... <= rhs[i]
                range_values[i] = lhs[i] - rhs[i]
            else: # rhs[i] > lhs[i]
                raise ValueError
        lin_expr = [[var_idx[r].tolist(), d] for r,d in zip(lil_matrix.rows, lil_matrix.data)]
        self.problem.linear_constraints.add(lin_expr=lin_expr,
               rhs=rhs, senses=senses, range_values=range_values)

However, it would then not be possible to transparently display, modify or remove these constraints. Nonetheless, it would be a simple method for efficienlty adding a large number of constraints in matrix format to a model.

cdiener commented 3 years ago

I think this is already handled by Constraint.set_linear_coefficients which will not use sympy at all. Also constraints don't get really added individually but collected internally and added in batches for speed up using the same strategy as in your example. When we benchmarked it using those strategies, it gives close to native backend performance. When I profiled it in MICOM (~100K - 1M) constraints it was rather limited by I/O or the backend but not really by optlang operations. But maybe that's different for your use case?

I think the issue is more that those strategies are not really documented anywhere.

Midnighter commented 3 years ago

In addition to what Christian said, please note that it can be very slow to iteratively build sympy expressions by adding terms, for example. That's why we use the aforementioned set_linear_coefficients as shown here for pFBA.

axelvonkamp commented 3 years ago

Thanks for pointing this out, I wasn't aware of this feature. I now did some benchmarking with a random sparse matrix

import cplex
import numpy as np
import scipy.sparse
import time

m = 10000
n = 10000
mat = np.random.rand(m, n)
mat[mat > 0.01] = 0 # make sparse
print(np.count_nonzero(mat))
mat = scipy.sparse.lil_matrix(mat)
col_names = ["V"+str(i) for i in range(n)]
lb = [-1000]*n
ub = [1000]*n

Integrating the whole matrix at once

o = optlang.cplex_interface.Model()
fv = [optlang.cplex_interface.Variable(nm, lb=l, ub=u) for nm,l,u in zip(col_names, lb, ub)]
o.add(fv)
o.update()
start_time = time.monotonic()
o.add_lp_matrix(mat, [f.name for f in fv], [0]*mat.shape[0], [0]*mat.shape[0])
print(time.monotonic() - start_time)

takes about 0.4 seconds while doing it constraint-wise

o = optlang.cplex_interface.Model()
fv = [optlang.cplex_interface.Variable(nm, lb=l, ub=u) for nm,l,u in zip(col_names, lb, ub)]
o.add(fv)
co = [optlang.cplex_interface.Constraint(0, lb=0, ub=0) for i in range(mat.shape[0])]
o.add(co)
o.update()
start_time = time.monotonic()
for c,r,d in zip(co, mat.rows, mat.data):
    c.set_linear_coefficients({fv[i]: j for i,j in zip(r, d)})
print(time.monotonic() - start_time)

takes about 16 seconds. Of course, this is for CPLEX and the difference may be smaller for other solvers. I guess the main reason for the difference is that the variable name to variable index lookup only needs to be done once when passing the whole matrix while it needs to be done for the variables in each constraint in the second case. I don't know if the time difference warrants introducing a new class LinearConstraintsMatrix as proposed though.

cdiener commented 2 years ago

Hi, if you look at the profile of the second case:

         18924435 function calls in 6.914 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.914    6.914 {built-in method builtins.exec}
        1    0.000    0.000    6.914    6.914 <string>:1(<module>)
        1    0.023    0.023    6.914    6.914 <ipython-input-45-567fbfb41482>:1(g)
    10000    0.038    0.000    6.892    0.001 cplex_interface.py:220(set_linear_coefficients)
    10000    0.036    0.000    6.133    0.001 _subinterfaces.py:1528(set_coefficients)
    20000    0.014    0.000    5.204    0.000 _baseinterface.py:33(_conv)
    20000    0.015    0.000    5.190    0.000 _aux_functions.py:353(convert)
    20000    0.728    0.000    5.130    0.000 _aux_functions.py:334(convert_sequence)
  1998270    0.567    0.000    4.164    0.000 _aux_functions.py:292(_cachelookup)
  1009135    0.514    0.000    3.597    0.000 _baseinterface.py:50(_get_index)
   999135    0.877    0.000    3.025    0.000 _procedural.py:1008(getcolindex)
...

you see that what makes the difference is not calling self.problem.linear_constraints.add with a matrix or row but, as you suspected, the name lookup in cplex. This is rather related to #116 since the cplex variable name lookup is insanely slow. You could try to port the strategy from #126 to Constraint.set_linear_coefficients, however, there would still be some overhead since you need to create the index mapping for each constraint anew. An alternative approach would be to accept variable indices in set_linear_coefficient (but this would have to be ported to all solvers) or to maintain an index cache (but updating this becomes slow when deleting variables). Maybe there is a faster version, I am not sure if the name validation in Model._get_variable_indices is really necessary if you would call an update before that.