Add padding to short sequences

jakebeal commented 2 years ago

For the 2022 distribution, add padding to all short sequences.

Consider use of @eyesmo 's randomSeqWithConsraints function that can generate random DNA without specific restriction sites, homopolymers, GC/AT stretches etc

eyesmo commented 2 years ago

Wrote these methods about a year ago in this Colab notebook, before Friendzymes started collaborating in earnest with Poly. The notebook doesn't explicitly have a genRandomSeqWithComstraints function (I wrote a different notebook with a less powerful genRandomSeqWithConstraints function a couple of years ago), but the functions in here make generating random sequences (and, for that matter, reverse translating protein sequences) with constraints very easy. The functions in here can take as input a single sequence or a data frame full of sequences. Some of the functions will return as dicts the locations and sequences of problematic subsequences, including rare codons, homopolymers, forbidden restriction sites, and regions of excessively high/low GC content (either single dicts for single sequences, or dataframes of dicts for dataframes of sequences). Other functions will parse through the sequence(s) to replace the problematic subsequences. These 'sequence cleaner' functions can operate recursively, checking whether any changes they've made have introduced new problematic subsequences, and then correcting those as well, until there are no more subsequences that match the 'problematic' criteria. Also they're written in Python, so they should be easy for Distro/Software team members proficient with Python to modify and adapt.

The way you'd use these functions to generate a random sequence with constraints is you just generate any random sequence, then feed that sequence into cleanGeneSeq, which performs all the problematic subsequence removals listed above.

FWIW I also think that if these functions could be refactored as a small package (something that I, admittedly, do not currently know how to do), they may be quite useful for the iGEM Engineering Committee's Distro efforts, especially for team members who aren't comfortable reading/writing in Go.

eyesmo commented 2 years ago

Two types of problematic sequence that these functions can't check for yet are regions of excessively stable predicted secondary structure, that could cause mRNA hairpins that block or slow down translation; and repeated non-homopolymer subsequences, that might interfere with polymerase-cycling-assembly-based gene synthesis and increase genetic instability via recombination. However, Poly has functions for predicting secondary structure, identifying excessively stable secondary structures, and mutating the sequence to remove such secondary structures. I believe Poly also has functions for identifying and removing subsequences that are repeats within the part, or even relative to a set of 'background' sequences (e.g. an organism's genome or the rest of a wetware library). Isaac G wrote a Colab notebook tutorial demonstrating how to use these functions, which you can see here.

iGEM-Engineering / iGEM-distribution

Add padding to short sequences #213