hackseq / hackseq_projects_2016

8 stars 2 forks source link

Project 2: Design a tool to optimize the parameters of any command line tool #9

Open ttimbers opened 8 years ago

ttimbers commented 8 years ago

Project: Given a command line tool with a number of parameters and a target metric to be optimized, I want a tool that will run the program and find the values for those parameters that maximizes some target metric. My particular use case is genome sequence assembly, which often has a variety of parameters related to expected coverage of the reads and heuristics to remove read errors and collapse heterozygous variation. When I tackle that optimization, the process is manual and tedious: submitting jobs to a scheduler, rerunning failed jobs, inspecting outputs, tweaking parameters, and repeating. I want to design and implement a tool to automate that process and generate a report of the result.

Project Lead: Shaun Jackman / @sjackman / Graduate Student / BC Cancer Agency Genome Sciences Centre

hyounesy commented 8 years ago

Interesting. I have also done some work on general parameter analysis of R/bioconductor packages as part of the VisRseq framework that might be applicable for this problem. I would be happy to provide further info if this may be of immediate use for you.

sjackman commented 8 years ago

Hi, Hamid. Thanks for your interest. One approach I had considered was using R functions like optim and optimize. Which R optimization function did you use?

I had also considered implementing simple optimization methods like grid search, line search, twiddle and amoeba aka Nelder–Mead. However it's implemented, I want it to be usable as a command-line program.

hyounesy commented 8 years ago

Makes sense. For VisRseq (which is not command-line) I am currently using a simple randomized search. I was interested in visual parameter exploration where a target metric does not necessarily exist. For example, for differential expression analysis tools (e.g. edgeR, DESeq, etc.), different parameters result in different genes detected as differentially expressed, but it is not always clear which result is better. I am however considering to experiment with better sampling (e.g. monte carlo) to distribute the parameter samples in the areas where there is more variation in the results.

sjackman commented 8 years ago

There are two existing tools for parameter optimization:

We definitely need to explore whether building on one of these existing tools fits our requirements.

There's also the possibility of building on other optimization systems such as R's optimize family of functions or Python's scipy.optimize. I don't know whether either of these handles optimization of discrete variables / metrics.

sseemayer commented 8 years ago

Sounds like a cool idea to have generic parameter optimization for cluster jobs :+1: !

ParOpt is a fairly simple script based on scipy.optimize that is limited to jobs running synchronously on one machine where the objective function to be optimized can be parsed from the command line's STDOUT. It does not support asynchronous execution or clusters at all (except for maybe jury rigging some shell scripts to submit/evaluate jobs).

Thinking about it, generic optimization could actually benefit greatly from parallelization. Since these generic methods cannot rely on gradients and are typically very slow to run, parallelizing the problem will make the wait times much shorter. Parallelization is definitely possible for grid search since that is embarrassingly parallel, but a quick google search gave me this article on parallel Nelder-Mead too.

It might be possible to run the function evaluations using DRMAA to abstract between different scheduling systems.

I'd be happy to answer any questions you might have about ParOpt :smile:

sjackman commented 8 years ago

Great! Thanks for your response, Stefan! I'll likely have more questions for you. Any chance you'd like to come to Vancouver, Canada for Hackseq Oct 15–17 and/or ASHG 2016 following it? 😄 You too, Richard @blahah! Come join us! Three ASHG sponsored travel awards ($1,500) are available!

sseemayer commented 8 years ago

Both your hackathon and conference sound amazing and I could visit my sister that lives in Redmond, WA so your offer is doubly tempting. On the other hand, I'll have just started my postdoc around then so I'll pass this time. I might come back on that offer next year, though!

sjackman commented 8 years ago

😢 Best of luck with your postdoc!

sjackman commented 8 years ago

We're planning to have a Docker image with a bunch of bioinformatics software preinstalled running on machines at the BC Cancer Agency Genome Sciences Centre during the Hackathon. Which bioinformatics software do you plant to use for your project? In particular, is there any software that you plan to use that is not already listed here? http://www.bcgsc.ca/services/orca

sjackman commented 8 years ago

I'll be using the two tools listed above, Biopsy and ParOpt, as well as Python scipy and R.

Yes, I'm talking to myself. I'm leading a project and also helping to organize the hackathon. =)

whitecat commented 8 years ago

This project seems like a cool area of work, running and testing is always a big problem in programming.

What features would you want more than Biopsy and ParOpt already offers? @sseemayer seems to say adding parallelization for cluster jobs. @sjackman refers to adding a different algorithm using R.

sjackman commented 8 years ago

I need to set aside some time to test out Biopsy and ParOpt, and to write up the requirements for this project. Off the top of my head:

Other suggestions welcome!

GlastonburyC commented 8 years ago

I imagine if there are multiple parameters there will be trade-offs? Perhaps a feature of the live updating report would allow you to display this information. Potentially a choice of what to optimize may be beneficial too, as well as joint optimization of all parameters.

What is the compute time of one run? - Important to know to help determine what opt strategy to use.

GlastonburyC commented 8 years ago

I would suggest using mystic, which includes a few of the scipy.optimize functions in a parallelized implementation.

sjackman commented 8 years ago

If I expand the scope to optimizing for multiple metrics, which is certainly relevant to me and genome assembly, then yes I expect a trade off, and a frontier of best possible solutions. Thanks for the link to mystic. I'll look into it. The compute time is roughly ten minutes to one day, depending on the data set (bacteria to human).

GlastonburyC commented 8 years ago

Excellent. I have applied for this project so I'll keep thinking about it.

sjackman commented 8 years ago

Great! Glad to hear it. My plan for the hackathon is to optimize for only one metric, but display the Pareto frontier of multiple (unoptimized) metrics. Is mystic intended for optimizing multiple variables?

GlastonburyC commented 8 years ago

Not that I know of. I was focusing on already implemented parallelized optimization functions.

blahah commented 8 years ago

@sjackman I worked a lot on multiple metrics when doing biopsy and assemblotron. Happy to discuss in live chat or elaborate on here if it helps.

sjackman commented 8 years ago

How did you optimize parameters? So far I've only ever used line and grid search, executed manually.

sjackman commented 8 years ago

Recommend an R package to identify the Pareto frontier of a set of 2D points? and ggplot it, but that should be easy

very easy to do. Just get euclidean distance of all points and those equidistant with the furthest make up the pareto frontier. you can adjust stringency for frontier membership according to your problem.

Euclidean distance from the origin? In this example of three points, (0,1), (2,0) and (3,0), both (0,1) and (3,0) with distances 1 and 3 are in the frontier, but (2,0) with distance 2 is not on the frontier, because it's worse than (3,0) on both metrics. How is the Euclidean distance helpful?

blahah commented 8 years ago

@sjackman for optimisation I used tabu search and a genetic algorithm - chosen because they perform well with expensive cost functions.

Euclidean distance from the optimum. I should have said 'equidistant with the closest', but actually that's wrong or at least an oversimplification. Here's an old figure I never used, showing the basic Euclidean distance dimension reduction for two transcriptome assembly metrics and also showing that you can assign weights to the dimensions to get a scaled or weighted score:

assembly_metrics_euclidean_distance

Algorithm to get the Pareto frontier set for any number of dimensions is (I think):

  1. add the point with the best value for each dimension to the frontier set
  2. choose any dimension and order the points by this dimension descending (let's say X)
  3. let dominator be the point with the best value for X. Iterate over the points by descending X, calculating the score excluding the X dimension.
  4. each time you find a point with a better score than dominator, add it to the frontier set and set dominator to be that point

If I'm thinking straight, you should end up with the Pareto frontier :)

blahah commented 8 years ago

oh, and I should explain:

you can adjust stringency for frontier membership according to your problem

Often you don't necessarily care about small differences in score, so in step 4 of the algorithm in addition to looking for points with a better score you can also record points that have a score within some margin of dominator, but importantly don't set any such point to be the dominator. Those points can be said to have proximity to the frontier.

sjackman commented 8 years ago

Great! Thanks for the detailed explanation, Richard!

larskotthoff commented 8 years ago

There are some tools that do this already:

mlrMBO isn't released as a package yet, but should be able to interface easily with other packages that allow to submit and manage cluster jobs.

sjackman commented 8 years ago

@dpo Hi, Dominique. I'm helping to organize a bioinformatics hackathon Hackseq in October in Vancouver in the three days before the ASHG 2016 conference. I want to develop a command line tool to optimize the parameters of command line bioinformatics software (particularly genome assembly) to optimize some objective metric (or often more than one, but we'll go with one for now) output by the tool. There's been quite a few suggestions of tools in the above conversation. Can you recommend any others?

dpo commented 8 years ago

@sjackman That's been one of my research projects in the past few years. A student of mine developed a tool named OPAL, which is basically a Python-based modeling language to describe algorithmic parameter optimization problems. The problem is then solved by means of a nonsmooth optimization solver named NOMAD, which offers convergence guarantees. It's able to treat continuous, integer and categorical parameters (categorical means discrete and without a natural ordering, e.g., "should I use strategy A, B or C?").

Technically, you can just let the nonsmooth optimization solver do its job but it's likely to be quite slow unless you provide it with some problem-specific knowledge that can be used to speed up the search (in optimization lingo, that's called a "surrogate model").

NOMAD evolves somewhat independently of OPAL and I admit I haven't been updating OPAL recently. However, it's probably a small job to fix it up. I'm happy to help. Here a few useful links but feel free to ask any question:

sjackman commented 8 years ago

Great! Thanks, Dominique! I've got a couple deadlines coming up, and I plan to work on this more actively in September. In the mean time, a couple quick questions:

  1. Can the function to be optimized be a command line program that takes for example arguments -a, -b and -c and produces a number on standard output?
  2. The program can take anywhere from minutes to many hours to run. Is that okay?
  3. Can many evaluations of the function be run in parallel?
dpo commented 8 years ago

Yes on all accounts. The idea is that we're optimizing an expensive objective (in terms of computational time or otherwise). The code can evaluate the objective at several locations in parallel. The number of such evaluations is determined by 1) the number of variables n in the problem, and 2) the strategy to select search directions (fixed once and for all at the beginning); this is either n+1 or 2n.

sjackman commented 8 years ago

Great! Sounds like a great fit. Any chance it integrates with an HPC scheduler to submit the jobs?

dpo commented 8 years ago

At the time we wrote the papers, OPAL worked with SunGrid Engine, LSF and MPI. Parallelism can be exploited at different levels. That's explained in one of papers I linked to above. If you don't have access to those journals, I can send them to you privately.

sjackman commented 8 years ago

I've printed the "Optimization of Algorithms with OPAL" paper, and I'll read it this weekend! Thanks, Dominique.

sjackman commented 8 years ago

I read the OPAL paper this weekend (while sitting on a lake) and it looks exactly like what I need. Thanks, Dominique.

sjackman commented 8 years ago

Another optimization package that I stumbled on: https://github.com/HIPS/spearmint

Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code name spearmint) in a manner that iteratively adjusts a number of parameters so as to minimize some objective in as few runs as possible.

dpo commented 8 years ago

Glad you like OPAL. Let me know of any difficulties when you try it. I'm very happy to help.

mgelbart commented 8 years ago

Hi all, @sjackman alerted me to this thread. I co-developed Spearmint, which @sjackman posted above. This thread already contains a lot of good stuff, but here's my two cents.

This problem is tricky for many reasons. First, it goes by many names: black-box optimization, algorithm configuration, derivative-free global optimization of expensive functions, etc etc etc. This happened because it lands at the intersection of several fields: statistics (experimental design), optimization, machine learning...

I don't think there's a perfect tool out there... each one has its strengths and weaknesses. Maybe it's useful to make a list of questions about your particular problem. Here is what I always ask:

I think the choice of tool depends on the answers. For example, Spearmint works best for <100 parameters, works better on continuous parameters, is good for very slow functions, can do noisy, can do parallel, and requires a bit of effort on the user's part. It also handles some fancier cases like black-box constraints, which are probably not needed here. But one thing mentioned above which it does handle is integration with remote compute clusters like SGE, DRMAA, etc.

BTW, if you use Spearmint, I suggest grabbing code of the PESC branch rather than master because the latest stuff is in there and I've been too lazy to merge into master. Actually I'll also be pushing some more improvements soon.

Here are some other tools I know about

The last one is a startup in the Valley doing just this. They have a very nice clean API and might be free for educational use but they charge a lot of money in general.

It would be nice to construct a sort of "decision tree" for choosing a tool based on the specifics of your problem, but nobody has done this yet AFAIK. Here are a few thoughts in that direction though:

mgelbart commented 8 years ago

(@dpo, I'm not sure if we've met, but I've certainly heard a lot about you from our mutual friend Sasha Aravkin. I had no idea you were working on this stuff!)

sjackman commented 8 years ago

Thanks for the detailed response, Mike. Here are the answers to your questions:

How many parameters are you optimizing?

A maximum of about 16 parameters. I can order them by priority to fit in whatever limit. The parameters that I'm most interested in number about 5. Even optimizing a single variable k would be useful. Doing it by hand is easy but tedious. Here's a typical list from ABySS: https://github.com/bcgsc/abyss#assembly-parameters

What is the nature of these parameters... are they continuous or discrete or mixed? If discrete, are they integers or categorical?

Nearly all integer. Two continuous variables, though the response to these variables will not be smooth but steppy.

How slow is each run of the objective?

Fifteen minutes for toy problems like bacteria, a day or two for typical problems like human, and a week for large problems like spruce.

Do you want to run jobs in parallel?

Yes. Integration with a scheduler would be helpful.

Are you comfortable writing a bit of glue code, firing up a database, etc?

Yes.

Thanks, Mike!

mgelbart commented 8 years ago

@sjackman Given your answers, I think Spearmint (or similar approaches) would be a reasonable fit. I could help out a bit. Spearmint deals with integers by pretending they are continuous variables, and then just rounding them at the end. This means it can do dumb things sometimes, but it's not horrible. There will probably be several good tools and it will come down to which one you find easier/convenient to use. Perhaps the cluster integration is a selling point of Spearmint.

Also, I forgot another important question (I edited my post above to add it to the list, for posterity): is your function noisy? In other words, if you run your code twice with the same parameters, will you get the same results? This is something that Spearmint handles naturally but will break some more classical optimization methods.

sjackman commented 8 years ago

No. The function is deterministic. At least ABySS is. Other genome assembly programs may not be. I'm a developer of ABySS.

sjackman commented 8 years ago

How about optimizing multiple metrics and finding the Pareto frontier? With genome assembly you want to optimize both the contiguity and the correctness of the assembly. There are other performance indicators, but those two are the most important. They can be blended into a single metric, called NGA50.

mgelbart commented 8 years ago

Ha! Funny that you asked that. This touches on my specific area of research, and is a rare case where Spearmint is way ahead of most alternatives. There are 3 approaches:

  1. blend everything into a single objective
  2. frame it as a constrained problem, e.g., maximize contiguity such that correctness is at least X
  3. do a true Pareto front optimization

Approach (1) means you can just use one of the techniques we've been talking about, but it is the least satisfying because you have to arbitrarily pick a weight for blending. For approach (2), well sometimes your problem really is of this nature and other times you are framing it this way when it's truly a multi-objective problem. In any case, Spearmint can deal with this if you're using that PESC branch of the repo I mentioned above. My thesis might be a good reference. For (3), my collaborators recently came out with code for this, which lives on yet another branch of the Spearmint repo. I haven't used this myself but I think it should be reasonably straightforward. Here's the paper.

Finally, once you get into this "multi-objective" territory, there is an additional question to ask yourself:

For most cases, the answer is the former -- yours too, probably? But if it's the latter than you can play some tricks to increase efficiency a lot. For example, imagine contiguity takes 30 seconds to measure and correctness takes 3 hours. Then you're much better off testing contiguity for many parameter values and zeroing in on promising regions of the space before investing the big resources in measuring correctness.

sjackman commented 8 years ago

Great! For this hackathon project, I plan on doing a mixture of 1 and 2 for the actual optimization. For example optimize contiguity, and then optimize correctness constrained that the contiguity can't decrease by more than 10% (for example). After the optimization I'll plot the Pareto front of all the runs that were run, without actually trying to optimize the front. It would be nice to eventually get to 3 though.

Do I get the different objectives from the same code run, or do they come separately?

All the metrics come from the same run, pretty much. It takes 95% or more of the time to get the genome assembly, and then just a bit longer to measure its contiguity and correctness, which are measured with independent programs, and correctness does take longer to measure than contiguity.

dpo commented 8 years ago

Nice to meet you @mgelbart ! I've certainly heard about you from Sasha as well. Thanks for listing so many alternatives for parameter optimization. It's a hot topic.

NOMAD, the black-box optimizer hiding underneath OPAL can also solve multi-objective problems but that feature hasn't been interfaced in OPAL. It wouldn't be too difficult to do I believe.

The main trick in NOMAD is to define a surrogate model that will speed up the optimization and somewhat expand the search to wider regions. The beauty of it is: anything can be a surrogate model. The surrogate need not be an "approximation" of the actual objective in the sense of numerical analysis. It should merely be a guide and indicate promising regions. There are examples in the papers I linked to above. In particular, any other optimizer listed above could be used as a surrogate. In that case, NOMAD will act as a safety net in case that optimizer fails to identify an improved guess, or fails for any other reason. The convergence guarantees will continue to apply.

dpo commented 8 years ago

I'd like to introduce two colleagues of mine who specialize in optimization with special interest in derivative-free and parameter optimization: Margherita and Philippe. Just so everybody can find out who everybody is, I'm going to list the home pages (I don't believe they currently are Github users):

Margherita: http://web.math.unifi.it/users/porcelli Philippe: http://perso.fundp.ac.be/~phtoint/toint.html

Margherita and Philippe both author a nonsmooth optimization package called BFO (for Brute-Force Optimization), a new kid on the block that we believe would also fit well in your testing schedule. BFO is written in Matlab and is built on principles that share similarities with the optimizer underlying OPAL:

https://sites.google.com/site/bfocode

BFO is being actively worked on and I am told it should support surrogate models and categorical variables in the near future (if it doesn't already). One of its distinctive features is that it will self-tune on the test problems given.

If you don't have access to Matlab, one of us will be happy to run the tests; we just have to engineer communication between Matlab and your tools.

We're hoping you'll include BFO in your tests. We're available to answer any questions and to contribute as we're always looking for interesting new applications of parameter optimization.

sjackman commented 8 years ago

This issue has permanently moved to https://github.com/hackseq/2016_project_2/issues/1