Open jakebeal opened 2 years ago
Couple ideas :
That doesn't solve the problem but decreases the probability of calling an incompatibility.
D1005 is a bit of an exception because it's for a cds part and relies on the starting ATG for the fusion site so we could search for ggtctcgaatg rather than ggtctcga. If we do that we may run into another issue I think #166
We could add a "barcode" on the 5' side that would also decrease the probability of finding it by chance. We're probably not looking for our own added flanking sequences.
I do think considering barcoding for a variety of reasons is important (separate issue for this though), but as @GC-repeat pointed out that won't help us when we're not looking for our own flanking sequences.
Should we assume at scan time that we don't know any properties of the part and its sequence itself, or the part in backbone and its sequence itself? Ex. do we know the bp size of the part, AA sequence, the plasmid it is in, the part type, etc? Asking because could we do probabilistic checks to determine if a flanking sequence would exist relative to known properties? Not a silver bullet!
From what I'm hearing, it sounds like there is not way to determine for certain whether an anonymous sequence imported from elsewhere starts with a flanking sequence or not. I'd recommend against using probabilistic checks, since that's saying we're OK with undetectable failures.
Instead, my recommendation would be to assume there are no flanking regions and to consider it an error if we find a part with unmarked flanking regions. The fix for an unmarked flanking region is then to explicitly mark the flanking region. We can have the automation make a suggestion for doing so, but it will need to be a human who decides whether to accept or reject that suggestion.
When did you want to check for flanking regions ? When adding ours or when creating the packages so it can be curated right away ?
@GC-repeat We are likely to want to be able to check at multiple times. Certainly we will need to do so when validating a package. We will also need to do so if we want to automate the addition of flanking regions.
Not sure if I'm on point, but GoldenBraid's domesticator does something similar to what @jakebeal proposes. You provide it with your FASTA sequence, it removes all unwanted sites within the seq and then assigns flanking seqs according to custom prefix/suffixes or according to Phytobrick standard.
The algorithm assumes that there is no flanking sequence already and adds it in the file you finally download. Combining this idea, we could scan the 5' and 3' prime ends of the seq for flanking seqs against a known list of sequences, or maybe somehow ask whether the sequence provided contains custom flanking sequences.
A question that will become important when we add flanking site automation: is there any way to distinguish between a part that already has a flanking site vs. a part that has an illegal cut site that happens to be identical to a flanking site?
Consider, for example the 5'flanking site D1005 from #150: ggtctcga
If we scan a part sequence and find that it starts with agctaacctaat..., then we know that it doesn't have the flanking site and we can safely add it.
If we scan a part sequence and find that it starts with ggtctcgataat..., however, is there any way to tell if we are dealing with: 1) a part that already has a flanking sequence built in (which is OK, we just need to annotate it), or 2) a part that is incompatible with this assembly and will fail if we use its prefix like a flanking sequence?