hackseq / 2016_project_2

Design a tool to optimize the parameters of any command line tool
MIT License
2 stars 6 forks source link

Write abstract #6

Open sjackman opened 7 years ago

sjackman commented 7 years ago

Hi, all. We have to write our scientific and lay abstract. See https://github.com/hackseq/October_2016/issues/76

Please write a few sentences about your work here. I'll put them all together. Thanks!

GlastonburyC commented 7 years ago

I tested grid-search based parameter optimization with the python tool, ParOpt. I coded a wrapper function in python that outputs a single value obtained from abyss that we were interested in optimizing, N50. First, I optimized k, k-mer length. I then extended the method to do simultaneous optimization of four parameters, k,s,n & l. Afterwards, to test the quality of the genome assembly, I generated reports for all optimization runs using QUAST, utilizing the reference genome to assess the completeness of each genome assembly.

drveera commented 7 years ago

I wrote an R script to plot a heat map of the output of spearmint optimization.

hyounesy commented 7 years ago

I used optimx which is an R library wrapping several (14) optimizers. To be able to use R, I wrote an R wrapper function to call abyss. Of all the methods, nmkb seemed to have the best performance. That was not striking as nmkb is a variation of Nelder-Mead for discrete parameter optimization.

lisabang commented 7 years ago

I used Jupyter notebook widgets with matplotlib to visualize how each iteration of optimization changes the parameters to maximize the parameter optimized. An example .ipynb file is available.

jasgrewal commented 7 years ago

Genetic Algorithms are a set of adaptive algorithms that have been applied in several optimization problems for the identification of Pareto Frontiers (a multi-metric optimal solution space for an evaluated function). GAs are also used in optimization problems because of their tractability for global optimization of multiple parameters simultaneously, in the presence of several local optimas. The Optimization algorithm chosen from among the set of genetic algorithms available for optimization in this project was Neighbourhood Sorting Genetic Algorithm II (NSGA II). This algorithm is able to optimize for multiple parameters, considering several output metrics for optimization simultaneously. Initially proposed in 2002 by K. Deb et al, this GA algorithm converges rapidly. I used it as part of the 'mco' package, an R package that uses genetic algorithms for multi-objective, multi-parameter optimzation https://cran.r-project.org/web/packages/mco/index.html. I searched for an optimal solution space in pre-determined ranges of the discrete valued ABYSS parameters k and s, with the output metrics N50 and L50. The NSGA II algorithm did indeed converge quite rapidly in both single- and multi- parameter (metric) optimizations, however, it was interesting to note that the best N50/L50 metric value pair of (23699, 4) was found when the optimization was performed on k and s simultaneously, and both N50 and L50 were being optimized for. The solution set in this case, defined as (k,s), was {(35, 650),(35, 690)}. It was interesting to note the multiple solutions proposed by the GA converging to the same output metric in this case (in all other scenarios for optimization of the parameters vs metrics, it converged to a single solution). Furthermore, even when the optimization criteria was searching for a solution space distributed in k and s, but only optimizing with respect to the N50 metric, the main response metric N50 was valued at 16461, 30% lower than the best. An additional outcome of this exercise was the demonstration of the utility of GA optimization for discrete valued parameters. A lot of optimization algorithms and tools that were evaluated in this project (including mco) rely on a continuous range, and we thus added a rounding step in the query building process for the ABYSS tool to mimic discretization of the solution space. It was interesting to note that the NSGA-II algorithm still converged to a fixed valued optimal solution space, showing potential for its application to discrete optimization problems.

jasgrewal commented 7 years ago

Apologies for making it so long, clubbed together my introduction and some of the interesting observations too. Cheers guys, it was great working with all of you!

lfunderburk commented 7 years ago

I tested a given data set (500k.fq ?) via Bayesian optimization using Spearmint and MongoDB. I collaborated with Shaun Jackman on this subpart and extended abyss.py to optimize for two parameters k and s. After 20 iterations a maximum for N50 of 19356 was found at k=28, s=200.

yxblee commented 7 years ago

I've implemented functions, in Python, to call upon ABySS and initiate assembly runs for the OPAL (OPtimization ALgorithm) program, a Python/Cython framework using a black-box program written in C/C++ and based on the NOMADS (Nonlinear Optimization Mesh-Adaptive Direct Search) algorithm. This allowed me to test for the optimization of the k value, given also its parameter, the variable N50 (maximum) to be measured against, and the dataset 200k.

While the NOMADS algorithm behind OPAL does allow for multiple continuous and discrete values, I have yet to try them. For reporting the results, I still need to find a way to record and display all its running inputs-versus-outputs, since the program currently only writes out the final optimized value for the given measure in a single file, which is overwritten after each run. I also need to implement some conditional statements so the program does not halt indefinitely from failed ABySS make due to invalid inputs, nor skip over files already created from previous runs.