iqbal-lab-org / pling

Plasmid analysis using rearrangement distances
MIT License
28 stars 1 forks source link

Using alternative solvers #14

Closed leoisl closed 11 months ago

leoisl commented 11 months ago

Hey there,

I could start another branch of development where I could evaluate the use of other solvers besides gurobi. It seems we have several other open source alternatives, much easier to install and with more permissive licenses, e.g.:

COIN-OR CBC: CBC is a part of the COIN-OR project, and it is one of the most reputable open-source solvers available. It supports reading input files in MPS and LP formats, which Gurobi also uses.

GLPK: It is also capable of solving ILP problems. GLPK can read files in LP and MPS formats.

SCIP: It is more than an ILP solver; it can handle a variety of problem types, including constraint integer programming. It can read LP and MPS formats as well.

LPSolve: It is another open-source solver that can solve linear programming (LP), mixed-integer linear programming (MILP), and other related problems. It also supports LP and MPS formats.

However, I don't know how you are planning the development of pling. I could do it, but to do so I'd ask you if you could enable forking of this repo so I can work on my own fork and don't pollute the branches here:

image

Another possibility is also to leave for Daria to evaluate this. Whatever you choose is fine by me!

iqbal-lab commented 11 months ago

I'd be delighted if you did this. I'd also be amazed if daria wasn't also delighted

leoisl commented 11 months ago

Just experimented with CBC:

So it seems CBC is an actual open-source alternative for us. Installation is as simple as conda install coin-or-cbc, but in both roundhound and pling pipelines, the user won't even need to do this (snakemake takes care of installing the required packages for each rule).

As this is a module that will be used by both RoundHound and Pling, I am thinking on creating a new repo, sth like easy-dingII that will basically wrap the original dingII, and will have a main python script that takes integerised plasmids and produces the output files, running the two ding scripts and the solver, which can be either CBC our gurobi. easy-dingII will be both pip and conda installable, so it is very easy for me and Daria plug it in the RoundHound/Pling pipeline. This should also fix the issue of both these tools "doing" the same thing, but using different versions of ding and different solvers, which is not great for consistency.

@iqbal-lab and @babayagaofficial you ok with this?

iqbal-lab commented 11 months ago

Sounds awesome. Worth checking results are the same and seeing how slow it is?

leoisl commented 11 months ago

yeah, thinking on running pling in one of Daria's lineages with CBC and and compare. Running RH would be a bit more work as we would need to first upgrade ding I to ding II

iqbal-lab commented 11 months ago

Yes let's ask @babayagaofficial

leoisl commented 11 months ago

Ok! Hey @babayagaofficial could I get a path on the cluster where you ran pling in one of your inc lineages? I think your inc lineages are not too large so that pling runs fast, and yet large enough to infer if CBC can actually be a good alternative to gurobi/cplex.

babayagaofficial commented 11 months ago

Sorry about the delay on this, but here's the directory with everything I have on the Inc plasmids:

/nfs/research/zi/daria/Inc_plasmids

it's not super well organised, so let me know if there's anything specific you're looking for

leoisl commented 11 months ago

thanks! should be alright, will try on this subset of 30 incy plasmids: /nfs/research/zi/daria/Inc_plasmids/fastas/incy_30... unless you think I should try on another set/subset

babayagaofficial commented 11 months ago

sounds reasonable to me!

leoisl commented 11 months ago

CBC and gurobi DCJ dists are identical:

image

only issue is that in 1 of the 258 instances it errors out:

Welcome to the CBC MILP Solver 
Version: 2.10.10 
Build Date: Apr 19 2023 

command line - cbc data/incy_30_out/tmp_files/ding/ilp/CP057753.1~CP057733.1_gurobi.lp -solve -threads 1 -sec 100 (default strategy 1)
### ERROR: 1 duplicates in objective and matrix

ERROR: CoinLpIO::readLp, ### ERROR: 1 duplicates in objective and matrix

There were -1 errors on input
** Current model not valid
threads was changed from 0 to 1
seconds was changed from 1e+100 to 100
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.00
iqbal-lab commented 11 months ago

Weird. Wonder if it is reproducible

leoisl commented 11 months ago

it is, it always fails in this case. It is a problem with the input that ding generates

leoisl commented 11 months ago

GLPK also fails on the same instance, but at least it gives a better logging:

GLPSOL--GLPK LP/MIP Solver 5.0
Parameter(s) specified in the command line:
 --lp data/incy_30_out_glpk/tmp_files/ding/ilp/CP057753.1~CP057733.1_gurobi.lp
 -w data/incy_30_out_glpk/tmp_files/ding/solutions/CP057753.1~CP057733.1.sol.tmp
 --tmlim 1000
Reading problem data from 'data/incy_30_out_glpk/tmp_files/ding/ilp/CP057753.1~CP057733.1_gurobi.lp'...
data/incy_30_out_glpk/tmp_files/ding/ilp/CP057753.1~CP057733.1_gurobi.lp:43: multiple use of variable 'x_1_2_1' not allowed
CPLEX LP file processing error

I think dingII is then building invalid LP inputs, but gurobi handles it anyway, while the other tools error out...

leoisl commented 11 months ago

With glpk, I could narrow down the issue to this line in the LP input:

c09.0: x_1_2_1 + x_1_2_1 - t_1_2_0 >= 0

This is not a valid constraint as the variable x_1_2_1 is repeated, but this is a valid constraint:

c09.0: 2 x_1_2_1 - t_1_2_0 >= 0

so this seems to be a dingII bug...

leoisl commented 11 months ago

Bug also present in dingI

iqbal-lab commented 11 months ago

Raise a bug?

leoisl commented 11 months ago

Alternative solvers state

So it seems to me cplex and gurobi are able to deal with this bug, while the other tools are more strict on the LP input and error out. However, cplex and gurobi have the installation and licensing issues, while the other are very easy to install and run. To use the other tools, which I think is still worth it, we would need to either submit a bug report to ding and wait for them to solve, or solve the bug ourselves. Annoyingly, both dingII and dingI are hosted in the Bielefeld gitlab instance where I can't create an account because any email is not allowed for sign-up (they might just accept their uni emails...). So I can't submit issues, or fork the code and try to solve the bug... We would need to contact them by mail to see what we could do here...

Pling/Roundhound situation

For a new evaluation in roundhound I'd need to compare two sets of plasmids. Pling would be great for this, because it was conceived to do this, however I am stuck with this solver issue. This also concerns pling usage in general. Currently there are 3 options:

  1. Try to use pling as it is now. I have gurobi locally installed with the free trial license, the only one I could get as EBI is not a recognised academic institution in gurobi website. This license won't make it as it has limitations on the model size, and it refused to run on some instances of the incy dataset. I can't use the gurobi installed in the EBI cluster because I am not authorised: Error 10009: Request denied: user 'leandro' not in authorized user list. A solution might be creating a ticket and asking systems to add me as an authorized user;
  2. Downgrade ding II to ding I and use cplex. This would definitely work for me, and it might be easier to some users, but then we would need to downgrade ding, which I am not sure is an option;
  3. Report or solve the ding II bug and use the open-source solvers. I am leaning towards this option. It will be more work, but I think it is worth it. If I am struggling with the commercial solvers, I think pling/RH users will as well.
leoisl commented 11 months ago

Raise a bug?

Are you ok with going option 3 then? I will contact Leonard Bohnenkämper by mail to see how we can proceed with them or us solving the bug

babayagaofficial commented 11 months ago

Downgrade to ding I isn't an option for Pling, because ding I doesn't have min/max indel counting implemented, which I'll need eventually.

I think option 3 is the best one in the long run, since we want people to be able to comfortably use pling on their own. Do you have Leonard's email or should I pass it on to you?

leoisl commented 11 months ago

I think I have it. I need to understand some stuff I am not aware before...

I am not updated on the new features that ding II has over ding I. Is the min/max indel counting embedded in the LP input? gurobi output looks like this:

# Solution for model obj
# Objective value = 23
t_1_64_0 0
t_1_2_0 0
t_2_3_0 1
t_3_4_0 0
t_3_69_0 0
x_3_69_0 1
t_4_5_0 1
t_4_70_0 0
x_4_70_0 1
t_5_6_0 0
<... all other variable assignments>

The other LP solvers output similar stuff, but I am getting a large difference on the amount of variable assignment when using gurobi, cbc or glpk:

wc -l glpk.sol cbc.sol.txt gurobi.sol
 2010 glpk.sol
  227 cbc.sol.txt
  700 gurobi.sol

I am wondering if you know what you need and will need in the future from gurobi output. Is it just the objective value? Or do the variable assignments matter for some downstream process?

babayagaofficial commented 11 months ago

yup, min/max indel counting is embedding in the LP input

I think the variable assignments are relevant to reconstructing the matchings of duplicates and also to counting indels, but I'm not 100% sure how the output parsing works in ding, would have to double check

leoisl commented 11 months ago

Ok! With Leonard's bugfix to dingII, CBC/GLPK worked on the small dataset of 30 incy plasmids. I am currently opting to go forward with GLPK because it is more stable, established and I know how to transform its output to a gurobi one. I am not so sure how to do this for the CBC output...

I should have a PR soon for this