Minimal information to represent an assembly

manulera commented 8 months ago

Tagging @BjornFJohansson @jakebeal @dgruano for input.

As described in more detail in https://github.com/BjornFJohansson/pydna/issues/165, I have recently implemented an alternative Assembly class based on the original pydna class. The fully documented source code can be seen here, but in essence there are three types of inputs:

A list of sequences (as Dseqrecord objects)
A function to find common substrings among pairs of sequences (by default common substrings anywhere, but can accept functions that would only find common substrings at the edges of the sequence, or one could also think of find common restriction sites)
Constrains (how long must the common substring be for it to be considered, should all fragments be used, etc.)

Representing the join of two fragments

An assembly is then represented as a list of "joins" between fragments, where each join is represented as (u, v, loc_u, loc_v).

u and v are integers, representing the index (1-based) of a joined fragment from the input list. The sign of the node key represents the orientation of the fragment, positive for forward orientation, negative for reverse orientation.
loc_u and loc_v are the locations of the common substring among u and v.

For example, the joining of the left part of fragment 1 and the second part of fragment 2 through their homology as shown below

1 AacgatCAtgctccaa                      ......
          ||||||            ==> AacgatCAtgctccTAAattctgc
2        TtgctccTAAattctgc

Would be represented as (1, 2, [8:14](+), [1:7](+)), here locations are represented as biopython does, but any representation would be fine.

If fragment 2 in the input given to the assembly would be reverse complemented, then the same joining would be represented as (1, -2, [8:14](+), [1:7](+)). The strand in the location is not strictly necessary, so it could be omitted.

Representing an assembly

An assembly can then be represented as a list of input fragments and a tuple of joins as described above, like this:

Linear: ((1, 2, '1[8:14](+):2[1:7](+)'), (2, 3, '2[10:17](+):3[1:8](+)'))
Circular: ((1, 2, '1[8:14](+):2[1:7](+)'), (2, 3, '2[10:17](+):3[1:8](+)'), (3, 1, '3[12:17](+):1[1:6](+)')) Note that the first and last fragment are the same in a circular assembly.

De-duplication

The same sequence output of an assembly can be described in several ways:

Linear outputs can be described in forward and reverse orientation
Circular outputs can be described in forward and reverse orientation, and all their circular permutations

To prevent de-duplication, the following constrains are applied:

Linear assemblies: the first fragment is in the forward orientation.
Circular assemblies: the first fragment is in the forward orientation, and has the smallest index in the input fragment list.

Use cases

Based on pydna's current uses, and some more

Gibson assembly
Homologous recombination
Representation of ligation of fragments with sticky overhangs (algorithm should return the location of compatible overhangs)
One step restriction-ligation (the algorithm could return common substrings based on the cutsite of restriction enzymes provided by the user)

Feedback

This is meant to be the minimal information that could then be translated into SBOL format. Any limitation or improvement to this? Feel free to leave your thoughts.

manulera commented 8 months ago

I already see a shortcoming, for the homologous recombination case: You may want to represent an homologous recombination between template 1 and insert 2 as below:

2          ttttxxxxxxxxxcccc
1 ---------tttt---------cccc------

This could be represented as ((1,2,loc1A,loc2A), (2,1,loc1B,loc2B)). This looks like a circular assembly (start and finish integers are the same). However, this would be now considered an invalid assembly because the loc1B is after the loc1A. This constrain is there to avoid impossible assemblies for example:

Compatible (overlap of 1 and 2 occurs before overlap of 2 and 3):

(1,2,[2:9],[0:7]), (2,3,[12:19],[0:7])
   -- A --
1 gtatcgtgt     -- B --
2   atcgtgtactgtcatattc
3               catattcaa

Incompatible (overlap of 1 and 2 occurs after overlap of 2 and 3):

(1,2,[2:9],[13:20]), (2,3,[0:7],[0:7])
                -- A --
 1 -- B --    gtatcgtgt
 2 catattcccccccatcgtgtactgt
 3 catattcaa

It could be made a rule that assemblies in which the first and last fragment are the same are circular if loc1a < loc1b and an integration otherwise. Alternatively, circular could be a separate boolean property of the assembly.

EDIT: That does not work as a general rule, so I think that the circularity of the assembly should be represented in a separate field.

BjornFJohansson commented 8 months ago

This is interesting and I have been thinking about a rewrite as well. I agree that the present one is very complex and perhaps does to many things at once.

Ill try to read and understand and get back to you.

manulera / ShareYourCloning_backend