Open ttimbers opened 8 years ago
Interesting. I have also done some work on general parameter analysis of R/bioconductor packages as part of the VisRseq framework that might be applicable for this problem. I would be happy to provide further info if this may be of immediate use for you.
Hi, Hamid. Thanks for your interest. One approach I had considered was using R functions like optim
and optimize
. Which R optimization function did you use?
I had also considered implementing simple optimization methods like grid search, line search, twiddle and amoeba aka Nelder–Mead. However it's implemented, I want it to be usable as a command-line program.
Makes sense. For VisRseq (which is not command-line) I am currently using a simple randomized search. I was interested in visual parameter exploration where a target metric does not necessarily exist. For example, for differential expression analysis tools (e.g. edgeR, DESeq, etc.), different parameters result in different genes detected as differentially expressed, but it is not always clear which result is better. I am however considering to experiment with better sampling (e.g. monte carlo) to distribute the parameter samples in the areas where there is more variation in the results.
There are two existing tools for parameter optimization:
We definitely need to explore whether building on one of these existing tools fits our requirements.
There's also the possibility of building on other optimization systems such as R's optimize
family of functions or Python's scipy.optimize
. I don't know whether either of these handles optimization of discrete variables / metrics.
Sounds like a cool idea to have generic parameter optimization for cluster jobs :+1: !
ParOpt is a fairly simple script based on scipy.optimize
that is limited to jobs running synchronously on one machine where the objective function to be optimized can be parsed from the command line's STDOUT. It does not support asynchronous execution or clusters at all (except for maybe jury rigging some shell scripts to submit/evaluate jobs).
Thinking about it, generic optimization could actually benefit greatly from parallelization. Since these generic methods cannot rely on gradients and are typically very slow to run, parallelizing the problem will make the wait times much shorter. Parallelization is definitely possible for grid search since that is embarrassingly parallel, but a quick google search gave me this article on parallel Nelder-Mead too.
It might be possible to run the function evaluations using DRMAA to abstract between different scheduling systems.
I'd be happy to answer any questions you might have about ParOpt :smile:
Great! Thanks for your response, Stefan! I'll likely have more questions for you. Any chance you'd like to come to Vancouver, Canada for Hackseq Oct 15–17 and/or ASHG 2016 following it? 😄 You too, Richard @blahah! Come join us! Three ASHG sponsored travel awards ($1,500) are available!
Both your hackathon and conference sound amazing and I could visit my sister that lives in Redmond, WA so your offer is doubly tempting. On the other hand, I'll have just started my postdoc around then so I'll pass this time. I might come back on that offer next year, though!
😢 Best of luck with your postdoc!
We're planning to have a Docker image with a bunch of bioinformatics software preinstalled running on machines at the BC Cancer Agency Genome Sciences Centre during the Hackathon. Which bioinformatics software do you plant to use for your project? In particular, is there any software that you plan to use that is not already listed here? http://www.bcgsc.ca/services/orca
I'll be using the two tools listed above, Biopsy and ParOpt, as well as Python scipy and R.
Yes, I'm talking to myself. I'm leading a project and also helping to organize the hackathon. =)
This project seems like a cool area of work, running and testing is always a big problem in programming.
What features would you want more than Biopsy and ParOpt already offers? @sseemayer seems to say adding parallelization for cluster jobs. @sjackman refers to adding a different algorithm using R.
I need to set aside some time to test out Biopsy and ParOpt, and to write up the requirements for this project. Off the top of my head:
Other suggestions welcome!
I imagine if there are multiple parameters there will be trade-offs? Perhaps a feature of the live updating report would allow you to display this information. Potentially a choice of what to optimize may be beneficial too, as well as joint optimization of all parameters.
What is the compute time of one run? - Important to know to help determine what opt strategy to use.
I would suggest using mystic, which includes a few of the scipy.optimize functions in a parallelized implementation.
If I expand the scope to optimizing for multiple metrics, which is certainly relevant to me and genome assembly, then yes I expect a trade off, and a frontier of best possible solutions. Thanks for the link to mystic. I'll look into it. The compute time is roughly ten minutes to one day, depending on the data set (bacteria to human).
Excellent. I have applied for this project so I'll keep thinking about it.
Great! Glad to hear it. My plan for the hackathon is to optimize for only one metric, but display the Pareto frontier of multiple (unoptimized) metrics. Is mystic
intended for optimizing multiple variables?
Not that I know of. I was focusing on already implemented parallelized optimization functions.
@sjackman I worked a lot on multiple metrics when doing biopsy and assemblotron. Happy to discuss in live chat or elaborate on here if it helps.
How did you optimize parameters? So far I've only ever used line and grid search, executed manually.
Recommend an R package to identify the Pareto frontier of a set of 2D points? and ggplot it, but that should be easy
very easy to do. Just get euclidean distance of all points and those equidistant with the furthest make up the pareto frontier. you can adjust stringency for frontier membership according to your problem.
Euclidean distance from the origin? In this example of three points, (0,1), (2,0) and (3,0), both (0,1) and (3,0) with distances 1 and 3 are in the frontier, but (2,0) with distance 2 is not on the frontier, because it's worse than (3,0) on both metrics. How is the Euclidean distance helpful?
@sjackman for optimisation I used tabu search and a genetic algorithm - chosen because they perform well with expensive cost functions.
Euclidean distance from the optimum. I should have said 'equidistant with the closest', but actually that's wrong or at least an oversimplification. Here's an old figure I never used, showing the basic Euclidean distance dimension reduction for two transcriptome assembly metrics and also showing that you can assign weights to the dimensions to get a scaled or weighted score:
Algorithm to get the Pareto frontier set for any number of dimensions is (I think):
dominator
be the point with the best value for X. Iterate over the points by descending X, calculating the score
excluding the X dimension.score
than dominator
, add it to the frontier set and set dominator
to be that pointIf I'm thinking straight, you should end up with the Pareto frontier :)
oh, and I should explain:
you can adjust stringency for frontier membership according to your problem
Often you don't necessarily care about small differences in score, so in step 4 of the algorithm in addition to looking for points with a better score you can also record points that have a score within some margin of dominator
, but importantly don't set any such point to be the dominator
. Those points can be said to have proximity to the frontier.
Great! Thanks for the detailed explanation, Richard!
@dpo Hi, Dominique. I'm helping to organize a bioinformatics hackathon Hackseq in October in Vancouver in the three days before the ASHG 2016 conference. I want to develop a command line tool to optimize the parameters of command line bioinformatics software (particularly genome assembly) to optimize some objective metric (or often more than one, but we'll go with one for now) output by the tool. There's been quite a few suggestions of tools in the above conversation. Can you recommend any others?
@sjackman That's been one of my research projects in the past few years. A student of mine developed a tool named OPAL, which is basically a Python-based modeling language to describe algorithmic parameter optimization problems. The problem is then solved by means of a nonsmooth optimization solver named NOMAD, which offers convergence guarantees. It's able to treat continuous, integer and categorical parameters (categorical means discrete and without a natural ordering, e.g., "should I use strategy A, B or C?").
Technically, you can just let the nonsmooth optimization solver do its job but it's likely to be quite slow unless you provide it with some problem-specific knowledge that can be used to speed up the search (in optimization lingo, that's called a "surrogate model").
NOMAD evolves somewhat independently of OPAL and I admit I haven't been updating OPAL recently. However, it's probably a small job to fix it up. I'm happy to help. Here a few useful links but feel free to ask any question:
Great! Thanks, Dominique! I've got a couple deadlines coming up, and I plan to work on this more actively in September. In the mean time, a couple quick questions:
-a
, -b
and -c
and produces a number on standard output?Yes on all accounts. The idea is that we're optimizing an expensive objective (in terms of computational time or otherwise). The code can evaluate the objective at several locations in parallel. The number of such evaluations is determined by 1) the number of variables n in the problem, and 2) the strategy to select search directions (fixed once and for all at the beginning); this is either n+1 or 2n.
Great! Sounds like a great fit. Any chance it integrates with an HPC scheduler to submit the jobs?
At the time we wrote the papers, OPAL worked with SunGrid Engine, LSF and MPI. Parallelism can be exploited at different levels. That's explained in one of papers I linked to above. If you don't have access to those journals, I can send them to you privately.
I've printed the "Optimization of Algorithms with OPAL" paper, and I'll read it this weekend! Thanks, Dominique.
I read the OPAL paper this weekend (while sitting on a lake) and it looks exactly like what I need. Thanks, Dominique.
Another optimization package that I stumbled on: https://github.com/HIPS/spearmint
Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code name spearmint) in a manner that iteratively adjusts a number of parameters so as to minimize some objective in as few runs as possible.
Glad you like OPAL. Let me know of any difficulties when you try it. I'm very happy to help.
Hi all, @sjackman alerted me to this thread. I co-developed Spearmint, which @sjackman posted above. This thread already contains a lot of good stuff, but here's my two cents.
This problem is tricky for many reasons. First, it goes by many names: black-box optimization, algorithm configuration, derivative-free global optimization of expensive functions, etc etc etc. This happened because it lands at the intersection of several fields: statistics (experimental design), optimization, machine learning...
I don't think there's a perfect tool out there... each one has its strengths and weaknesses. Maybe it's useful to make a list of questions about your particular problem. Here is what I always ask:
I think the choice of tool depends on the answers. For example, Spearmint works best for <100 parameters, works better on continuous parameters, is good for very slow functions, can do noisy, can do parallel, and requires a bit of effort on the user's part. It also handles some fancier cases like black-box constraints, which are probably not needed here. But one thing mentioned above which it does handle is integration with remote compute clusters like SGE, DRMAA, etc.
BTW, if you use Spearmint, I suggest grabbing code of the PESC
branch rather than master
because the latest stuff is in there and I've been too lazy to merge into master. Actually I'll also be pushing some more improvements soon.
Here are some other tools I know about
The last one is a startup in the Valley doing just this. They have a very nice clean API and might be free for educational use but they charge a lot of money in general.
It would be nice to construct a sort of "decision tree" for choosing a tool based on the specifics of your problem, but nobody has done this yet AFAIK. Here are a few thoughts in that direction though:
optimize
, Nelder-Mead, genetic algorithms. These will not be as sparing in the use of function evaluations as, say, Bayesian optimization methods. OTOH, if your function is not that slow, heavyweight software like Spearmint may be overkill, leading to a big inefficiency if the optimizer itself takes much more time than the function evaluations.(@dpo, I'm not sure if we've met, but I've certainly heard a lot about you from our mutual friend Sasha Aravkin. I had no idea you were working on this stuff!)
Thanks for the detailed response, Mike. Here are the answers to your questions:
How many parameters are you optimizing?
A maximum of about 16 parameters. I can order them by priority to fit in whatever limit. The parameters that I'm most interested in number about 5. Even optimizing a single variable k would be useful. Doing it by hand is easy but tedious. Here's a typical list from ABySS: https://github.com/bcgsc/abyss#assembly-parameters
What is the nature of these parameters... are they continuous or discrete or mixed? If discrete, are they integers or categorical?
Nearly all integer. Two continuous variables, though the response to these variables will not be smooth but steppy.
How slow is each run of the objective?
Fifteen minutes for toy problems like bacteria, a day or two for typical problems like human, and a week for large problems like spruce.
Do you want to run jobs in parallel?
Yes. Integration with a scheduler would be helpful.
Are you comfortable writing a bit of glue code, firing up a database, etc?
Yes.
Thanks, Mike!
@sjackman Given your answers, I think Spearmint (or similar approaches) would be a reasonable fit. I could help out a bit. Spearmint deals with integers by pretending they are continuous variables, and then just rounding them at the end. This means it can do dumb things sometimes, but it's not horrible. There will probably be several good tools and it will come down to which one you find easier/convenient to use. Perhaps the cluster integration is a selling point of Spearmint.
Also, I forgot another important question (I edited my post above to add it to the list, for posterity): is your function noisy? In other words, if you run your code twice with the same parameters, will you get the same results? This is something that Spearmint handles naturally but will break some more classical optimization methods.
No. The function is deterministic. At least ABySS is. Other genome assembly programs may not be. I'm a developer of ABySS.
How about optimizing multiple metrics and finding the Pareto frontier? With genome assembly you want to optimize both the contiguity and the correctness of the assembly. There are other performance indicators, but those two are the most important. They can be blended into a single metric, called NGA50.
Ha! Funny that you asked that. This touches on my specific area of research, and is a rare case where Spearmint is way ahead of most alternatives. There are 3 approaches:
Approach (1) means you can just use one of the techniques we've been talking about, but it is the least satisfying because you have to arbitrarily pick a weight for blending. For approach (2), well sometimes your problem really is of this nature and other times you are framing it this way when it's truly a multi-objective problem. In any case, Spearmint can deal with this if you're using that PESC
branch of the repo I mentioned above. My thesis might be a good reference. For (3), my collaborators recently came out with code for this, which lives on yet another branch of the Spearmint repo. I haven't used this myself but I think it should be reasonably straightforward. Here's the paper.
Finally, once you get into this "multi-objective" territory, there is an additional question to ask yourself:
For most cases, the answer is the former -- yours too, probably? But if it's the latter than you can play some tricks to increase efficiency a lot. For example, imagine contiguity takes 30 seconds to measure and correctness takes 3 hours. Then you're much better off testing contiguity for many parameter values and zeroing in on promising regions of the space before investing the big resources in measuring correctness.
Great! For this hackathon project, I plan on doing a mixture of 1 and 2 for the actual optimization. For example optimize contiguity, and then optimize correctness constrained that the contiguity can't decrease by more than 10% (for example). After the optimization I'll plot the Pareto front of all the runs that were run, without actually trying to optimize the front. It would be nice to eventually get to 3 though.
Do I get the different objectives from the same code run, or do they come separately?
All the metrics come from the same run, pretty much. It takes 95% or more of the time to get the genome assembly, and then just a bit longer to measure its contiguity and correctness, which are measured with independent programs, and correctness does take longer to measure than contiguity.
Nice to meet you @mgelbart ! I've certainly heard about you from Sasha as well. Thanks for listing so many alternatives for parameter optimization. It's a hot topic.
NOMAD, the black-box optimizer hiding underneath OPAL can also solve multi-objective problems but that feature hasn't been interfaced in OPAL. It wouldn't be too difficult to do I believe.
The main trick in NOMAD is to define a surrogate model that will speed up the optimization and somewhat expand the search to wider regions. The beauty of it is: anything can be a surrogate model. The surrogate need not be an "approximation" of the actual objective in the sense of numerical analysis. It should merely be a guide and indicate promising regions. There are examples in the papers I linked to above. In particular, any other optimizer listed above could be used as a surrogate. In that case, NOMAD will act as a safety net in case that optimizer fails to identify an improved guess, or fails for any other reason. The convergence guarantees will continue to apply.
I'd like to introduce two colleagues of mine who specialize in optimization with special interest in derivative-free and parameter optimization: Margherita and Philippe. Just so everybody can find out who everybody is, I'm going to list the home pages (I don't believe they currently are Github users):
Margherita: http://web.math.unifi.it/users/porcelli Philippe: http://perso.fundp.ac.be/~phtoint/toint.html
Margherita and Philippe both author a nonsmooth optimization package called BFO (for Brute-Force Optimization), a new kid on the block that we believe would also fit well in your testing schedule. BFO is written in Matlab and is built on principles that share similarities with the optimizer underlying OPAL:
https://sites.google.com/site/bfocode
BFO is being actively worked on and I am told it should support surrogate models and categorical variables in the near future (if it doesn't already). One of its distinctive features is that it will self-tune on the test problems given.
If you don't have access to Matlab, one of us will be happy to run the tests; we just have to engineer communication between Matlab and your tools.
We're hoping you'll include BFO in your tests. We're available to answer any questions and to contribute as we're always looking for interesting new applications of parameter optimization.
This issue has permanently moved to https://github.com/hackseq/2016_project_2/issues/1
Project: Given a command line tool with a number of parameters and a target metric to be optimized, I want a tool that will run the program and find the values for those parameters that maximizes some target metric. My particular use case is genome sequence assembly, which often has a variety of parameters related to expected coverage of the reads and heuristics to remove read errors and collapse heterozygous variation. When I tackle that optimization, the process is manual and tedious: submitting jobs to a scheduler, rerunning failed jobs, inspecting outputs, tweaking parameters, and repeating. I want to design and implement a tool to automate that process and generate a report of the result.
Project Lead: Shaun Jackman / @sjackman / Graduate Student / BC Cancer Agency Genome Sciences Centre