opencobra / cobrapy

COBRApy is a package for constraint-based modeling of metabolic networks.
http://opencobra.github.io/cobrapy/
GNU General Public License v2.0
455 stars 211 forks source link

[Feature] Nullspace calculation similar to COBRA (MATLAB) #1393

Open dagl1 opened 1 month ago

dagl1 commented 1 month ago

Checklist

Problem

Using the same model, implementation of addloops (addLoopLawConstraints in cobra toolbox) takes significantly longer (30-300 seconds versus <1 second) in cobrapy than its matlab counterpart. Timing each step I find that it is the function nullspace, and specifically inside this function the line: , s, vh = np.linalg.svd(A) that is taking up the majority if the ocmputation time (98%+).

In the matlab implementation a custom (it seems?) function for the nullspace is utilized, and I wondered if it is possible to implement and add this to how cobrapy handles the nullspace approximation? https://github.com/opencobra/cobratoolbox/blob/572cad82f50b93d7ddeab5d2b5e5e9fca9da4250/external/base/utilities/sparseNull.m#L4

Solution

Implement matlab's version of approximating the nullspace to achieve similar performance as in MATLAB

Alternatives

I attemtped using scipy's sparse svds function, but this was incompatible (produced no null space at all), as well as scipy's regular svd function (which in the few tests i did is about 2x as slow).

Anything else?

No response

cdiener commented 1 month ago

Yes absolutely, that would be nice to have, especially since null spaces are also used in other places (flux sampling for instance). Do you want to take a stab at this? Happy to review a PR...

dagl1 commented 1 month ago

I suppose haha, I must say I will first have to figure out most of the stuff that goes on in there. I was porting some matlab pipeline to cobrapy and stumbled upon this. If there are people that are more familiar with the optimizations already implemented in scipy, and how the matlab sparseNull actually is faster, then they might be more suited. I will eventually figure something out that will make this run at reasonable times, so when I do, you will know about it

cdiener commented 1 month ago

Okay sounds good. Otherwise, somebody else may get to it at some point. Another option would be trying around with a sparse QR decomposition. That may be a bit simpler to implement.

dagl1 commented 1 month ago

@cdiener So I wrote a functioning (I think, still need to write some tests) sparseNull with included partial code for the LUQ decomposition (as sparseNull doesn't utilize the full LUQ function (the "do_pivot" is not utilized)). I have never done any work for open source projects so are there things I should add in?

When I have a little more time this week; I will create a test (input matrix) and show that the python version provides the same nullspace as the matlab function. I will add in full annotation and comments. I currently utilize both np and scipy, is this an issue (I can try rewriting it to only use 1 package). The current implementation is probably not optimal as I go from sparse to dense matrices a few times because I couldn't get it to work otherwise. I hope that that is okay (and otherwise you or others might have some ideas).

Once I have verified it really is working in every instance, I will upload for review!

As a side note, I realize we do not utilize Chan et al 2017's LLC method for addLoopLaw? Interestingly my matlab code for addLoopLawConstraints is different from the one present on the COBRA github. I am getting fairly bad performance solving the MILP with the full set of constraints (for the same model that I can solve in MATLAB with reduced constraints from the LLC method). Are there any plans to add in the LLC method into cobrapy?

cdiener commented 1 month ago

Hi @dagl1, that sounds great and I can help you with setting up the PR when you are done. The basic workflow would be:

  1. Fork the cobrapy repo
  2. Create a new branch from devel with a descriptive name, for instance feature/faster_nullspace
  3. Check out the branch and add your new code to src/cobra/util/array.py and substitute the existing nullspace funtion with yours
  4. Add your new tests into tests/test_util/test_array.py
  5. Check that everything works on your branch with pytest
  6. Format your code with black and isort
  7. Push to your local branch on Github and open a PR into devel on the cobrapy repo
  8. After that I can jump in and will do the review and leave comments

No we have no plans for LLC, though it would be cool to have. Though it would probably be only the C_NS version because EFMs are currently not in COBRAPY either. We mostly provide the CycleFree method for fast loopless solutions but this has its own issues as pointed out in the Chan paper as well (though it is likely faster than LLC because it's just LPs).

dagl1 commented 1 month ago

Perfect, thanks a lot for that workflow, will implement it at a later date!

Regarding the CycleFree, if I understand correctly this method does not in fact make anything infeasible if it contains a loop, it just is some type of after-cleaning or is there a smart way of using this to also remove loops during a run? (I am attempting to calculate the highest average expression of a metabolic task (set input/output exchange reactions, all other boundary reactions closed) and without infeasible constraints on loops, the solution will just be loops of highly expressed reactions with a small set necessary for the actual completion of the metabolic task. In such cases, trying to find flux distributions afterwards that are loopless is counterintuitive as such a distribution might be taking a completely different path in the first place (with a different average score). Cyclefree cannot be used in any such manner right?

cdiener commented 1 month ago

Yes you are right, that wouldn't work because the loops themselves go into your objective. How would you go about maximizing that anyway? Because the absolute value trick (min |v_f - v_r| = min v_f + v_r) only works for minimization but not for maximization...

dagl1 commented 1 month ago

Hmm so there are a few different things, I create an irrversible model so that all fluxes are >=0 and utilize binary variable Y denoting activity. Vi - Yi eps (0.001 or whatever is computationally doable) >= 0 so that if Vi is smaller than epsilon then Yi is 0 Vi - Yi M <=0 (forces Yi to be active if Vi is active) And a linkage of reactions such that: Yforward +Ybackward <= 1 (so if rxn1 was reversible, now rxn_f and rxn_b are limited to at most 1 active reaction)

I was also trying to figure out a way to get a new continous variable Z such that it is atleast the absolute value of V, Z>= -V Z >= V while this works, it unfortunately also allows Z to be above 0 if V = 0, which then kind of destroys its usefulness to actually link it to Yi (any ideas on how I can force it to be 0 if V = 0?)

For the actual implementation of average, the linearization is a bit... difficult (but someone had created a transformation (charnes cooper) and I have implemented the set of constraints for that from: https://math.stackexchange.com/questions/3500493/doing-a-charnes-cooper-transformation-with-matrices-and-an-zero-one-constraint

cdiener commented 1 month ago

Okay that makes sense and is the common strategy I have seen as well for absolute value maximization. For Z you could do simply Z = v_f + v_b because at least one of them is zero because of your indicator constraint, no? That would work with zero fluxes too and cobrapy are always in standard form anyways (irreversible).

Also for large problems I would also try the new hybrid solver which also has a faster MIP solver, unless you are already using cplex or gurobi.

dagl1 commented 1 month ago

I already run everything in gurobi (computationally I get pretty bad performance in general when a task definition is complex). Regarding the Z variable, I don't think I follow; The first constraint doesn't actually work with v < 0 (as it would never be feasible), so if v can be anywhere between -1000 and 1000, then the first constraint cannot be used. Vi - Yi eps >= 0 Vi - Yi M <=0

So when introducing Z as in Z >= V and Z >= -V, I don't actually have any insurance that V == 0 means z = 0. If I would add in something like Z = V + Y then Z would A not be the absolute (as V could be -1000). I might be misunderstanding you?

Either way, irreversible models seem the way to go in general, just because it is easier to work with, and the additional reactions don't seem to pose too many problems.

cdiener commented 1 month ago

So I would put everything in the standard form. So there are no negative v_i just v_f,i and v_b,i which are always >=0. Then impose the big M constraints on each individual forward and reverse flux and have the pair constraint like before y_f,i + y_b,i <= 1. In that case |v_i| = |v_f,i - v_b,i| = v_f,i + v_b,i, because one of them is strictly zero due to the previous constraint.