manulera / ShareYourCloning_backend

The backend application for ShareYourCloning
MIT License
1 stars 6 forks source link

Minimal information to represent an assembly #51

Closed manulera closed 3 weeks ago

manulera commented 8 months ago

Tagging @BjornFJohansson @jakebeal @dgruano for input.

As described in more detail in https://github.com/BjornFJohansson/pydna/issues/165, I have recently implemented an alternative Assembly class based on the original pydna class. The fully documented source code can be seen here, but in essence there are three types of inputs:

Representing the join of two fragments

An assembly is then represented as a list of "joins" between fragments, where each join is represented as (u, v, loc_u, loc_v).

For example, the joining of the left part of fragment 1 and the second part of fragment 2 through their homology as shown below

1 AacgatCAtgctccaa                      ......
          ||||||            ==> AacgatCAtgctccTAAattctgc
2        TtgctccTAAattctgc

Would be represented as (1, 2, [8:14](+), [1:7](+)), here locations are represented as biopython does, but any representation would be fine.

If fragment 2 in the input given to the assembly would be reverse complemented, then the same joining would be represented as (1, -2, [8:14](+), [1:7](+)). The strand in the location is not strictly necessary, so it could be omitted.

Representing an assembly

An assembly can then be represented as a list of input fragments and a tuple of joins as described above, like this:

De-duplication

The same sequence output of an assembly can be described in several ways:

To prevent de-duplication, the following constrains are applied:

Use cases

Based on pydna's current uses, and some more

Feedback

This is meant to be the minimal information that could then be translated into SBOL format. Any limitation or improvement to this? Feel free to leave your thoughts.

manulera commented 8 months ago

I already see a shortcoming, for the homologous recombination case: You may want to represent an homologous recombination between template 1 and insert 2 as below:

2          ttttxxxxxxxxxcccc
1 ---------tttt---------cccc------

This could be represented as ((1,2,loc1A,loc2A), (2,1,loc1B,loc2B)). This looks like a circular assembly (start and finish integers are the same). However, this would be now considered an invalid assembly because the loc1B is after the loc1A. This constrain is there to avoid impossible assemblies for example:

Compatible (overlap of 1 and 2 occurs before overlap of 2 and 3):

(1,2,[2:9],[0:7]), (2,3,[12:19],[0:7])
   -- A --
1 gtatcgtgt     -- B --
2   atcgtgtactgtcatattc
3               catattcaa

Incompatible (overlap of 1 and 2 occurs after overlap of 2 and 3):

(1,2,[2:9],[13:20]), (2,3,[0:7],[0:7])
                -- A --
 1 -- B --    gtatcgtgt
 2 catattcccccccatcgtgtactgt
 3 catattcaa

It could be made a rule that assemblies in which the first and last fragment are the same are circular if loc1a < loc1b and an integration otherwise. Alternatively, circular could be a separate boolean property of the assembly.

EDIT: That does not work as a general rule, so I think that the circularity of the assembly should be represented in a separate field.

BjornFJohansson commented 8 months ago

This is interesting and I have been thinking about a rewrite as well. I agree that the present one is very complex and perhaps does to many things at once.

Ill try to read and understand and get back to you.