scikit-bio / scikit-bio

scikit-bio: a community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources.
https://scikit.bio
BSD 3-Clause "New" or "Revised" License
864 stars 271 forks source link

Codon Optimization module #701

Open AndreaEdwards opened 9 years ago

AndreaEdwards commented 9 years ago

Hi all, I noticed that there is not module, class, or function for calculating a codon optimized sequence.

I would like to help by contributing code for this calculator, but I would need some guidance. I have found some useful tools in BioPython and a very old SynBio python library that someone posted on Bitbucket in 2009 (https://bitbucket.org/chapmanb/synbio/src/tip/SynBio/). The code (which is not usable in it's current state) for calculating codon optimized sequences is here (https://bitbucket.org/chapmanb/synbio/src/7b1b3a972b7ed9e6b5bfb081c1c19b4a6b4410c2/SynBio/Codons/Optimize.py?at=default). From looking through this library, there seems to be a lot of scripts that would be really useful for the FORGE project including barcoded plate tracking and database schemas, but the lack of documentation leaves this library unusable. Aside from emailing the author, which I will do, does anyone have any advice on how to go about using this code?

For codon optimization, we would need the following input:

  1. amino acid sequence of target protein
  2. chassis strain (such as E. coli) to host expression of the target gene

From here, we would choose a method for codon optimization. For example we could use the following codon sampling strategy: Codon frequency matching ("codon harmonization"). Roughly, this means look at the native mRNA and its uses of codons and mimic this in the target species; a codon which is rare in the native should be replaced with one rare in the target. Logic: some rare codons may just help fold things properly.

Any feedback would be helpful.

Thanks, Andrea

gregcaporaso commented 9 years ago

Hi @AndreaEdwards. Thanks for the question and interest in contributing scikit-bio!

This is definitely useful functionality, but I'm not sure if it makes more sense to be in scikit-bio, or be a stand-alone package that you develop which depends on scikit-bio. It'd be a little easier for us to decide if we had some example code to look at. Would you be interested in starting to work on it, and then point us at the code? It'd be relatively easy to adapt it for scikit-bio or prepare it as stand-alone package at that point, so I don't think would add effort.