hackseq / 2016_project_2

Design a tool to optimize the parameters of any command line tool
MIT License
2 stars 6 forks source link

Updates #1

Open sjackman opened 8 years ago

sjackman commented 8 years ago

Hi, all! My name is Shaun, and I'm leading project 2 (optimization). I am a bioinformatics PhD student, a developer of Linuxbrew and ABySS, an open-source programmer, an avid traveller, a singer and an experimental amateur chef. Please introduce yourself here, and I look forward to meeting you in October!

sjackman commented 8 years ago

There's some existing conversation about this project over at this GitHub issue: https://github.com/hackseq/hackseq_projects_2016/issues/9

sjackman commented 8 years ago

If you're new to the command line and git, check out these particularly good interactive training sites: http://rik.smith-unna.com/command_line_bootcamp/ https://try.github.io https://github.github.com/on-demand/intro-to-github/

lisabang commented 7 years ago

Hey, I'm Lisa. I'm a bioinformatics analyst with Geisinger Health System in Pennsylvania, but I'm originally from California. See you in October.

ps: should we use docker images? is that mandated?

sjackman commented 7 years ago

Hi, Lisa! We can use Docker images, but no they're not mandated. We'll be using Amazon for compute services. So we'll need an Amazon Machine Image, and we could use a Docker image. We also have access to the ORCA Docker HPC service at the BC Cancer Agency Genome Sciences Centre.

I'm a developer on Linuxbrew, and I like to use Linuxbrew to install bioinformatics formula in Homebrew-Science, so one likely Docker image to use is linuxbrew/linuxbrew.

sjackman commented 7 years ago

Hi, all! @GlastonburyC @lisabang @hyounesy @yxblee

Here's the rough plan for Hackseq:

  1. Identify functions and data sets to optimize
    1. a toy function that can be optimized in minutes for development
    2. a real genome assembly problem that can be optimized in a few hours
  2. Evaluate optimizers for usability and speed
    1. OPAL by @dpo — Optimization of algorithms with OPAL
    2. Spearmint by @mgelbart — Predictive Entropy Search for Multi-objective Bayesian Optimization
    3. ParOpt by @sseemayer which uses scipy.optimize
    4. Possibly Python packages, like scikit-optimize
    5. Possibly R packages, a long list
  3. Generate a report of the results of the optimization
    1. Generate plots of target metric vs parameters
    2. Draw the Pareto frontier of the target metric and a second metric of interest (contiguity and correctness) likely in R using ggplot
  4. Write a short report of our experience
    1. Post on GitHub pages
    2. Possibly submit to a preprint server (bioRxiv, PeerJ, Figshare)
    3. Possibly submit for peer review, such as F1000Research Hackathons

There's a whole bunch of other optimizers discussed over at https://github.com/hackseq/hackseq_projects_2016/issues/9

GlastonburyC commented 7 years ago

Hi @sjackman, this seems like a very thorough and well thought out plan. As I will be working remotely, I think it would be helpful to delegate tasks some-what in advance.

I would be very happy to evaluate the optimizers for usability and speed and write these results up on github. This seems like a very useful piece of work to have published, so i do think we should aim for a preprint, and I would happily contribute to the writing of such.

In regards to optimizing genome assembly, perhaps if you could suggest a tool you use for genome assembly that we can focus on and the parameters of interest? Additionally, supplying a dummy dataset - perhaps a very small un-assembled genome - we can test optimization against would be good? I work in RNA-seq so I'm unfamiliar with 1) genome assembly tools, 2) parameters that show 'good' assembly.

Cheers.

sjackman commented 7 years ago

Great to have you on board, Craig! I was thinking of assigning one optimizer to evaluate to each participant. You can always come back for more if you finish that one. Do you have familiarity with either Python or R, and would you like to pick one of the optimizers to evaluate?

I would be very happy to evaluate the optimizers for usability and speed and write these results up on github. This seems like a very useful piece of work to have published, so i do think we should aim for a preprint, and I would happily contribute to the writing of such.

Great! I'm hoping to assign one person to continuously develop the report throughout the weekend, and then we can all contribute to writing and editing on the last day.

In regards to optimizing genome assembly, perhaps if you could suggest a tool you use for genome assembly that we can focus on and the parameters of interest?

I am a developer of the assembler ABySS, so I plan to use it for this hackathon because I'm most familiar with it, but I plan for the knowledge gained to be broadly applicable to any assembler.

Additionally, supplying a dummy dataset - perhaps a very small un-assembled genome - we can test optimization against would be good?

Here's a hands-on tutorial/activity that I developed on genome assembly. Exercise 3 shows how to assembly a small data set, a human bacterial artificial chromosome (BAC), using ABySS. This dataset will be one of the data sets that we use.

http://sjackman.ca/abyss-activity/#exercise-3-assemble-the-reads-into-contigs-using-abyss

I work in RNA-seq so I'm unfamiliar with 1) genome assembly tools, 2) parameters that show 'good' assembly.

The key metrics are contiguity (1) and correctness (2 through 4).

  1. contiguity (NG50, N50) and aligned contiguity (NGA50, NA50)
  2. number of breakpoints when aligned to the reference as a proxy for misassemblies
  3. number of mismatched nucleotides when aligned to the reference, Q = -10*log(mismatches / total_aligned)
  4. completeness, number of reference bases covered by aligned contigs divided by number of reference bases

We'll be optimizing the NG50 metric (or NGA50 with a reference genome) and reporting (but probably not optimizing) the correctness metrics. The primary parameter we'll be optimizing is k (a fundamental parameter of nearly all de Bruijn graph assemblers), and there's a bunch other parameters that we can play with (typically thresholds related to expected coverage).

GlastonburyC commented 7 years ago

Hi @sjackman, thanks! that's all very helpful.

I'm familiar with R and python - I would like to explore ParOpt as it seems quite generalisable.

sjackman commented 7 years ago

Excellent. I believe it uses scipy.optimize and Nelder-Mead, also known as the amoeba. It's all yours!

sjackman commented 7 years ago

Dominique Orban @dpo wrote…

I'd like to introduce two colleagues of mine who specialize in optimization with special interest in derivative-free and parameter optimization: Margherita and Philippe. Just so everybody can find out who everybody is, I'm going to list the home pages (I don't believe they currently are Github users):

Margherita: http://web.math.unifi.it/users/porcelli Philippe: http://perso.fundp.ac.be/~phtoint/toint.html

Margherita and Philippe both author a nonsmooth optimization package called BFO (for Brute-Force Optimization), a new kid on the block that we believe would also fit well in your testing schedule. BFO is written in Matlab and is built on principles that share similarities with the optimizer underlying OPAL:

https://sites.google.com/site/bfocode

BFO is being actively worked on and I am told it should support surrogate models and categorical variables in the near future (if it doesn't already). One of its distinctive features is that it will self-tune on the test problems given.

If you don't have access to Matlab, one of us will be happy to run the tests; we just have to engineer communication between Matlab and your tools.

We're hoping you'll include BFO in your tests. We're available to answer any questions and to contribute as we're always looking for interesting new applications of parameter optimization.

sjackman commented 7 years ago

Hi, Dominique @dpo, Margherita, Philippe. Yes, please do post this e-mail on the GitHub issue.

We won't have Matlab on the AWS instances that we'll be using, so we we won't be able to test BFO ourselves. If you'd like to test it yourselves on the datasets that we'll be using, I'd be happy to share the datasets and answer any questions that you have. Most info will all be available on the public GitHub repo. There will be some private Slack conversations as well, and I'd be happy to invite you to the Slack. I prefer GitHub for communication though, so we'll be using that mostly.

Cheers, Shaun

lisabang commented 7 years ago

I'm more proficient in Python than R; on slack as lisabang. Will spend tomorrow traveling to Vancouver, looking forward.

daisieh commented 7 years ago

Hello! I am interested in helping out with this project. I'm a general software engineer and phylogeneticist, so I work a lot with both ends of this sort of thing, as a user and a coder of tools.

lisabang commented 7 years ago

@GlastonburyC I'd be interested in helping out with ParOpt

GlastonburyC commented 7 years ago

@sjackman What's a sensible range to optimize k with respect to K50?

sjackman commented 7 years ago

The read length for this data set is 50 nucleotides per read. A reasonable range for k is 16 to 50.

GlastonburyC commented 7 years ago

@sjackman Great. Could you make available the 10x smaller genome assembly problem you mentioned on Slack?

sjackman commented 7 years ago

The 10 fold smaller dataset is on ORCA at /home/sjackman/sjackman/HS0674/200k.fq.gz

GlastonburyC commented 7 years ago

Hi - I cannot push to this project as I don't have permission: fatal: unable to access 'https://github.com/hackseq/2016_project_2.git/': The requested URL returned error: 403

GlastonburyC commented 7 years ago

I've implemented grid-search optimisation for Abyss using ParOpt as a first step.

GlastonburyC commented 7 years ago

As a next step, now that I have a value of k that is optimal - I need to compare N50 vs correctness metrics. I'm not sure which metrics are simply output by Abyss or ones that need generating using additional means (alignment). Ideas? 👍 @sjackman

sjackman commented 7 years ago

For smallish genomes the go-to package for evaluating correctness is Quast http://quast.bioinf.spbau.ru

The toy data set is human from chromosome 3. You can use this reference genome http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr3.fa.gz

sjackman commented 7 years ago

Another option (rather than evaluating a second metric) would be to optimize multiple parameters. I'd suggest optimizing both k and s simultaneously. You add n next if that goes well.

There's a description of the parameters here: https://github.com/bcgsc/abyss