CompositePart API integration

Koeng101 commented 5 years ago

Is your feature request related to a problem? Please describe. I would like to have API endpoints that would let me start adding CompositeParts to the database.

Describe the solution you'd like Post+put+get endpoints for CompositeParts, as well as a documentation update with https://freegenes.dev/api/docs/. After this is done, I can begin adding CompositeParts

vsoch commented 5 years ago

I'd like to suggest we start with a form of some type to create this - because then you can easily select Parts from current existing. Can you give me some dummy data (with some random set of parts) for me to test?

Koeng101 commented 5 years ago

The main reason I ask for this is because we're going to need to update ALL samples, and each composite is probably going to have on order ~10 parts. If I was just selecting from a list, that's going to be about 10,000 search + clicks.

I don't really want to have to manually go through a form thousands of times. Plus, once I actually get to updating the data, we can have a better sense of what is necessary in the form.

vsoch commented 5 years ago

ah that's definitely the wrong use case for a form! (Although maybe in the future it would be wanted?)

So for the post request, I'm guessing you would want to provide name, composite type, composite id, and then a list of unique ids for the parts to include (order matters). And then description (and composite_id?) should be optional, and the sequence is just some sequence field from the part being put together from 1..N?

What I can do is implement the API views, test them with dummy data locally, and then add a function to freegenes-python so you can do something like:

response = client.create_composite_part(name="party",
                                                                  composite_type="..."
                                                                  composite_id="..."
                                                                  part_ids=[uuid1, uuid, uuid3])

vsoch commented 5 years ago

okay I have a basic create view (this is just locally) - as soon as we review required / not required fields, and talk about an example POST, I'll test that out locally, then develop into freegenes-python, then update the server when all is finished!

Koeng101 commented 5 years ago

Those look good to me!

One thing to discuss that I think might be important that I left out - parts have directions (not just order) in the composite. You could derive that from the sequence itself, but I think it would be a lot better if there was someway to attribute direction in the many-to-many table. I'm not sure how hard that would be to implement - and we could always implement it later once composite views become important (sequence holds all the data you need), but if it is simple enough, we could do now.

vsoch commented 5 years ago

Hmm, could you give an example? You mean like ABC vs. CBA? You are saying that I part might be input into FreeGenes as ABC, but then would be used as CBA in a composite part?

Koeng101 commented 5 years ago

Like you could have

(A, ->), (B, ->), (C, ->) or (A, <-), (B, ->), (C, <-)

Basically, DNA has direction, so A binds your RNA polymerase which then runs in the direction that the second element in the tuple encodes, in the first example B will have an RNA polymerase run through it, while in the second example B will not have an RNA polymerase run through it (thus no gene expression)

vsoch commented 5 years ago

Hmm, so would it be too simple to have a list, the same length as the number of parts, with some direction identifier?

Koeng101 commented 5 years ago

That works. Could even just be a string with <><<<><><>

vsoch commented 5 years ago

okay I'm going to see if I can read / think more about this tonight. If you have some time, could you confirm the fields that are required for the endpoint? That's all that I need to finalize the endpoint and then implement freegenes-python. I assume that one POST == one CompositePart creation, and that order (and direction, which I'll need to figure out) matters?

Koeng101 commented 5 years ago

okay I have a basic create view (this is just locally) - as soon as we review required / not required fields, and talk about an example POST, I'll test that out locally, then develop into freegenes-python, then update the server when all is finished!

In this section, yes that request body makes sense, with the addition of "direction_string" and a part list.

Yes, order and direction matters, and one POST == one CompositePart

vsoch commented 5 years ago

okay awesome! I'll try to get all that stuff finished up tomorrow - since I have all that I need I should be able to get started working first thing in the morning.

vsoch commented 5 years ago

hey @Koeng101 which of the sequences are parsed together to form the new sequence? E.g.

- original_sequence
- optimized_sequence 
- synthesized_sequence
- full_sequence

I assume at some future point we will want to search over the sequences, so we want to store it (as discussed previously!) What I'm doing is putting the new sequence together based on a combination of parts (and directions) - I just want to make sure I use the right one.

Koeng101 commented 5 years ago

optimized_sequence is the one that can be parsed together to form a new sequence, though normally there is quite a bit of unannotated regions, so I could just make new parts for each unannotated region, but it seems like it would be easier to just keep the sequences separate

vsoch commented 5 years ago

So - do we want to store a sequence for a CompositePart or no? What I'm doing now is not requiring it, and piecing it together from the parts supplied, e.g.,:

        sequence = ""
        for i, part_id in enumerate(part_ids):
            part = Part.objects.get(uuid=part_id)
            direction = direction_string[i]

            # Forward >
            if direction == ">":
                sequence = sequence + part.optimized_sequence

            # Reverse <
            else:
                sequence = sequence + part.optimized_sequence[::-1]

Koeng101 commented 5 years ago

I think it should be required, but without a method that tries to default it to something. Data integrity is really important there, so we want to make sure that the user is being deliberate with the sequence

vsoch commented 5 years ago

So the user is required to assemble the sequence? You don't think that's a lot to ask?

So - linking to the parts and direction isn't a sure-fire formula to actually reproduce a sequence for a composite part - I might know parts A,B, and C, and directions all forward >>> but the sequence that they create doesn't necessarily mean A.sequence + B.sequence + C.sequence.

If that's the case, why do we store the connection and direction of the parts to begin with?

vsoch commented 5 years ago

Just for clarity about what I've implemented so far - you start with selecting some number of part ids, and use the client function to create a composite:

name = "Another Dinosaur Part"
part_ids = ['4c3f7727-567f-44eb-a3e7-c9be20109e4f',
            '4001a70e-9451-4608-97fc-f6fb6da54ba3',
            '6e8841c6-b867-41ae-afee-e9388123a027']

> composite_part = client.create_composite_part(name=name, part_ids=part_ids)

If the response is successful (response 201) you'll get the complete part back, including the assembled sequence:

# composite_part
{'uuid': '57ab6a75-e20b-444f-823b-3d17fc51ce82',
 'time_created': '2019-09-27T13:47:21.028694-05:00',
 'time_updated': '2019-09-27T13:47:21.028752-05:00',
 'name': 'Dinosaur Part',
 'description': None,
 'composite_id': None,
 'composite_type': None,
 'sequence': 'ATGCAGCGGGCGCGACCCACGCTCTGGGCCGCTGCGCTGACTCTGCTGGTGCTGCTCCGCGGGCCGCCGGTGGCGCGGGCTGGCGCGAGCTCGGCGGGCTTGGGTCCCGTGGTGCGCTGCGAGCCGTGCGACGCGCGTGCACTGGCCCAGTGCGCGCCTCCGCCCGCCGTGTGCGCGGAGCTGGTGCGCGAGCCGGGCTGCGGCTGCTGCCTGACGTGCGCACTGAGCGAGGGCCAGCCGTGCGGCATCTACACCGAGCGCTGTGGCTCCGGCCTTCGCTGCCAGCCGTCGCCCGACGAGGCGCGACCGCTGCAGGCGCTGCTGGACGGCCGCGGGCTCTGCGTCAACGCTAGTGCCGTCAGCCGCCTGCGCGCCTATCTGCTGCCAGCGCCGCCAGCTCCAGGTGAGCCGCCCGCGCCAGGAAATGCTAGTGAGTCGGAGGAGGACCGCAGCGCCGGCAGTGTGGAGAGCCCGTCAGTCTCCAGCACGCACCGGGTGTCTGATCCCAAGTTCCACCCCCTCCATTCAAAGATAATCATCATCAAGAAAGGGCATGCTAAAGACAGCCAGCGCTACAAAGTTGACTACGAGTCTCAGAGCACAGATACCCAGAACTTCTCCTCCGAGTCCAAGCGGGAGACAGAATATGGTCCCTGCCGTAGAGAAATGGAGGACACACTGAATCACCTGAAGTTCCTCAATGTGCTGAGTCCCAGGGGTGTACACATTCCCAACTGTGACAAGAAGGGATTTTATAAGAAAAAGCAGTGTCGCCCTTCCAAAGGCAGGAAGCGGGGCTTCTGCTGGTGTGTGGATAAGTATGGGCAGCCTCTCCCAGGCTACACCACCAAGGGGAAGGAGGACGTGCACTGCTACAGCATGCAGAGCAAGTGAATGTTGCTGCCTCTCCCTCCCCAGTCCTCAAAGCCTGTCCCGAAGAAAAGTCAAAGTTCCAAAATCGTGCCCTCACATCACGATCCATCCGAGAAAACCGGAGAGAACTGCCAGACAAAAATATCTCCTAGTTCACTTCAGGAAAGTCCCAGTAGCCTGCAGGGAGCACTCAAGAAACGCAGTGCATTTGAAGATCTTACCAATGCATCACAGTGTCAGCCCGTCCAGCCAAAGAAAGAAGCAAACAAAGAGTTCGTAAAGGTGGTGTCCAAGAAAATAAATCGCAACACGCACGCCCTTGGGCTGGCTAAGAAGAACAAACGAAATCTTAAATGGCACAAACTGGAGGTAACCCCAGTGGTCGCCAGTACTACCGTCGTGCCAAACATAATGGAGAAGCCGCTGATTCTGGATATCTCCACGACATCTAAGACGCCAAACACCGAGGAGGCAAGTCTTTTCAGGAAACCTTTGGTGTTGAAGGAGGAACCGACAATTGAAGATGAAACGCTCATCAATAAGTCCCTGTCTTTGAAAAAGTGCAGCAACCACGAGGAAGTCTCCCTCCTCGAGAAATTGCAACCTCTCCAAGAAGAATCTGATTCAGACGATGCCTTCGTAATAGAACCCATGACTTTTAAAAAGACGCACAAGACTGAAGAGGCAGCCATCACGAAGAAAACCCTTAGTCTCAAAAAGAAGATGTGTGCATCTCAGCGAAAACAATCTTGCCAGGAGGAAAGCCTCGCTGTACAGGACGTAAACATGGAAGAGGATTCCTTTTTTATGGAAAGTATGTCTTTCAAGAAAAAGCCTAAGACTGAGGAGAGCATACCGACACACAAATTGTCCAGCCTTAAAAAGAAGTGCACGATCTATGGTAAAATATGTCATTTCAGAAAACCTCCCGTTCTGCAAACGACAATTTGTGGCGCCATGTCTAGCATTAAGAAACCGACCACTGAGAAGGAAACGCTGTTCCAAGAACTCTCAGTTCTCCAAGAGAAACACACCACAGAGCATGAGATGTCCATTCTGAAAAAGTCATTGGCACTGCAAAAGACTAATTTCAAAGAGGACTCTCTCGTTAAGGAGTCTCTCGCGTTCAAGAAGAAGCCTTCAACCGAGGAAGCGATAATGATGCCCGTAATCCTTAAGGAACAGTGCATGACTGAGGGAAAGCGGTCCCGCTTGAAGCCGCTTGTACTGCAGGAGATTACGAGTGGCGAAAAGAGCCTTATAATGAAGCCACTCAGCATAAAAGAGAAACCTTCAACTGAAAAGGAGTCATTTAGCCAAGAGCCGTCCGCGCTCCAAAAGAAACATACCACGCAAGAAGAAGTTTCTATCTTGAAGGAGCCTAGCTCACTCCTTAAAAGTCCCACGGAAGAGAGTCCCTTCGATGAGGCTTTGGCTTTTACGAAGAAATGTACGATAGAGGAGGCGCCACCCACCAAGAAACCACTGATCCTGAAGCGAAAACACGCGACCCAAGGTACTATGTCACATCTGAAAAAGCCATTGATCCTGCAAACTACCTCCGGGGAGAAGAGTCTGATTAAGGAACCACTTCCTTTTAAGGAAGAGAAGGTAAGCCTTAAGAAAAAGTGCACAACTCAAGAAATGATGTCAATTTGCCCGGAACTGCTCGATTTCCAAGACATGATAGGCGAAGATAAAAACAGCTTCTTTATGGAACCGATGTCCTTCCGCAAAAACCCAACCACTGAGGAAACAGTATTGACTAAGACCTCATTGTCACTCCAGGAGAAGAAAATTACTCAAGGAAAAATGTCCCATTTGAAAAAGCCGTTGGTACTTCAGAAGATTACGAGCGAGGAGGAGTCCTTTTACAAGAAACTTCTCCCGTTCAAGATGAAGTCAACGACTGAAGAGAAATTTCTGAGTCAGGAACCTAGTGCGTTGAAGGAAAAGCACACAACCCTCCAGGAAGTTTCATTGAGTAAGGAGAGCCTTGCAATCCAGGAGAAAGCGACAACTGAAGAAGAGTTTAGCCAAGAGCTTTTCAGCCTCCATGTAAAACACACAAATAAAAGCGGGAGTCTTTTTCAAGAGGCACTTGTTTTGCAGGAAAAAACGGATGCGGAGGAGGACTCACTTAAAAATCTCCTGGCGCTCCAGGAGAAGTCAACTATGGAAGAAGAAAGTCTGATTAACAAGCTTCTCGCTCTTAAAGAAGAACTGTCAGCGGAGGCCGCCACCAATATCCAGACGCAACTGTCATTGAAGAAAAAATCAACCAGTCACGGTAAGGTCTTTTTCCTGAAGAAGCAACTCGCTCTCAATGAAACCATCAACGAAGAAGAGTTCTTGAACAAACAGCCACTGGCATTGGAGGGTTACCCCAGTATAGCGGAGGGCGAAACGCTCTTTAAAAAGTTGCTTGCAATGCAAGAAGAACCGTCTATTGAAAAGGAGGCAGTTTTGAAGGAACCGACAATTGATACAGAAGCTCACTTCAAGGAGCCACTTGCATTGCAAGAGGAACCGTCAACGGAAAAAGAGGCGGTCCTCAAGGAACCATCTGTAGATACAGAGGCTCACTTTAAGGAAACGCTCGCTCTCCAAGAAAAACCCTCAATTGAACAAGAAGCCCTGTTTAAAAGGCATAGTGCACTGTGGGAGAAACCCTCAACTGAAAAGGAAACCATCTTCAAAGAGTCTCTGGACCTTCAAGAGAAACCCAGCATTAAGAAAGAAACCCTCCTTAAGAAACCCCTTGCGCTGAAAATGTCCACCATTAACGAAGCTGTATTGTTCGAGGACATGATAGCGCTGAACGAGAAGCCTACCACGGGTAAGGAGCTTAGTTTCAAAGAACCCTTGGCTTTGCAAGAATCTCCGACCTACAAAGAAGATACATTTCTTAAAACGCTGCTCGTACCTCAGGTTGGGACTTCTCCGAACGTCAGCTCCACCGCACCTGAAAGTATCACGTCCAAATCTAGTATCGCAACCATGACTTCCGTTGGAAAAAGCGGAACCATTAACGAAGCGTTCCTGTTTGAAGATATGATTACTCTGAATGAAAAACCTACCACGGGGAAAGAGCTCAGCTTCAAGGAACCATTGGCGCTGCAGGAATCACCAACTTGCAAGGAAGATACCTTTCTCGAAACCTTTCTCATTCCTCAAATCGGTACCTCACCATACGTATTTTCTACCACACCGGAGAGTATCACCGAAAAGTCCAGCATAGCCACAATGACTAGTGTAGGGAAAAGTAGAACAACCACGGAATCCAGCGCGTGCGAGAGTGCATCAGATAAACCTGTCTCTCCTCAGGCAAAAGGAACACCGAAAGAGATCACCCCACGCGAGGATATAGACGAGGATTCCAGTGATCCCAGCTTCAATCCCATGTACGCGAAGGAAATCTTCTCCTATATGAAGGAACGAGAAGAACAGTTTATACTCACTGATTATATGAATCGGCAGATCGAAATTACCTCTGATATGAGAGCGATTCTCGTAGATTGGCTCGTAGAGGTCCAAGTGAGTTTCGAAATGACCCACGAAACACTCTATCTCGCTGTTAAGCTGGTTGATTTGTATCTCATGAAAGCCGTCTGTAAAAAGGACAAGCTGCAGCTCTTGGGAGCTACGGCATTTATGATCGCGGCGAAATTCGAGGAACACAACTCCCCGAGAGTTGATGACTTTGTATACATTTGCGACGACAATTACCAGAGAAGTGAAGTCCTTTCCATGGAAATAAATATTTTGAACGTACTCAAATGTGATATAAACATCCCGATAGCCTATCATTTTCTCCGCCGCTACGCCCGGTGCATCCACACCAATATGAAAACCCTGACGCTGAGTCGATATATCTGCGAGATGACACTTCAGGAGTACCACTATGTTCAGGAAAAAGCTAGTAAGCTCGCCGCGGCGAGCTTGCTTCTCGCGCTGTACATGAAAAAGTTGGGCTACTGGGTCCCTTTTCTCGAGCACTATAGTGGATACTCCATCTCTGAACTTCATCCTCTTGTAAGGCAGTTGAACAAGCTTCTTACCTTCTCTTCCTATGACTCCTTGAAAGCAGTTTACTATAAATACAGCCACCCGGTATTTTTTGAGGTCGCGAAAATACCCGCCCTTGATATGCTTAAACTCGAGGAAATTCTGAACTGCGATTGTGAAGCTCAAGGACTGGTCCTCTGAATGAATCCGGCCCTAGGCAACCAGACGGACGTGGCGGGCCTGTTCCTGGCCAACAGCAGCGAGGCGCTGGAGCGAGCCGTGCGCTGCTGCACCCAGGCGTCCGTGGTGACCGACGACGGCTTCGCGGAGGGAGGCCCGGACGAGCGTAGCCTGTACATAATGCGCGTGGTGCAGATCGCGGTCATGTGCGTGCTCTCACTCACCGTGGTATTCGGCATCTTCTTCCTCGGCTGCAATCTGCTCATCAAGTCCGAGGGCATGATCAACTTCCTCGTGAAGGACCGGAGGCCGTCTAAGGAGGTGGAGGCGGTGGTCGTGGGGCCCTACTGA',
 'parts': ['4c3f7727-567f-44eb-a3e7-c9be20109e4f',
  '4001a70e-9451-4608-97fc-f6fb6da54ba3',
  '6e8841c6-b867-41ae-afee-e9388123a027'],
 'label': 'compositepart'}

But you are saying you want the user to provide the sequence (required) so it would be like this:

name = "Another Dinosaur Part"
part_ids = ['4c3f7727-567f-44eb-a3e7-c9be20109e4f',
            '4001a70e-9451-4608-97fc-f6fb6da54ba3',
            '6e8841c6-b867-41ae-afee-e9388123a027']
sequence = "GTGTG...."

> composite_part = client.create_composite_part(name=name, part_ids=part_ids, sequence=sequence)

Which of the above is correct?

vsoch commented 5 years ago

And if the user provides the sequence, what purpose does direction_string serve, if we can't validate anything?

vsoch commented 5 years ago

Anyhoo - this create function is almost ready to go - as soon as we confirm these details I can do PRs for the server and freegenes-python, and point you at some docs to get started!

I'm taking a short break to make some cocoa (been going at it since 8am and it's almost 3pm) and I'll be back after that!

vsoch commented 5 years ago

hey @Koeng101 I'm doing a dummy test, I've put together two optimized sequences (one forward and one reverse) and I get the right answer, however there are two additional optimized sequences found, and the reason is because they are really short (3) - so we find both TGA and AGT:

parts['129e6622-1f82-4de0-a24d-427d513f005d']['optimized_sequence']                                         
 'TGA'

We run into trouble because we need to define some ordering, but in this case we have two long sequences parsed together (like ><) and then this third part overlaps somewhere in there. What would be correct to do?

vsoch commented 5 years ago

okay I think I have a pretty cool solution! We first cache ALL parts from the API on the client - this is run once and takes maybe 20 seconds, but then you don't need to do it again for the session After that, we search through all parts and look to see if the optimized sequence appears in the new sequence either forward or reversed direction. That gives us a list of contender parts, and we also store the direction, start, and end index.

We then model it as a scheduling problem - so we sort the found parts into a queue based on the total length (where the longest is at the end) and then pop the longest from the end off the queue. While we have entries in the queue, we pop off the next entry, and add it to the final list of sequences only given that there isn't overlap. Given that it's sorted, this should give us the (likely) best match (meaning the longest sequences found that have no overlap).

def derive_parts(self, sequence):
    '''based on a sequence, search all freegenes parts for the sequence,
       forward and backwards. This is done by the client (and not on the
       server) as to not tax the server. We cache the parts request to
       not make the same one over and over.

       Algorithm:
       =========
       1. Cache all parts from the API (one call)
       2. Find all forward and reverse substrings that match
       3. Model as interview scheduling problem

       If the user is interested in ALL possible combinations of parts,
       we would want to remove the "best solution" parts (the first part)
       from the list and try again.
    '''
    self._cache_parts()

    # Parts found to match
    coords = []

    for uuid, part in self.cache['parts'].items():

        # Only use parts with optimized sequences
        if part.get('optimized_sequence'):
            forward = part['optimized_sequence']
            reverse = forward[::-1]

            # Case 1: we found the forward sequence
            if forward in sequence:
                for match in re.finditer(forward, sequence): 
                    coords.append((part.get('uuid'), ">", match.start(), match.end()))

            # Case 2: we found the reverse sequence
            if reverse in sequence:
                for match in re.finditer(reverse, sequence): 
                    coords.append((part.get('uuid'), "<", match.start(), match.end()))

    # Make a queue sorted by how long they (end - start)
    queue = sorted(coords, key=lambda tup: tup[3]-tup[2])

    def overlaps_with(selected_sequences, element):
        '''determine if an element overlaps with any current elements in the
           list
        '''
        for selected in selected_sequences:

           # If the element start is greater than selected start, less than end
           if (element[2] >= selected[2]) and (element[2] < selected[3]):
               return True

           # If the element end is greater than the selected start, less than end
           if (element[3] > selected[2]) and (element[3] <= selected[3]):
               return True

        return False

    selected_sequences = []

    while queue:

        # Pop the longest element ( the last )
        element = queue.pop()

        # If there is no overlap add
        if not overlaps_with(selected_sequences, element):
            selected_sequences.append(element)

    # Need to sort them again, by the start
    selected_sequences = sorted(selected_sequences, key=lambda tup: tup[2])

    return selected_sequences

If we want to find ALL possible sets of parts that have overlap, we could take an additional step to remove the longest from the queue, and run it again. For example, for my testing case the optimal solution (and the right one) was the two part sequences that I manually put together, one forward and one reverse:

# uuid, direction, start, end
[('81a92bdc-2b71-48de-bdfc-fafcf9bf26ed', '>', 0, 1023),
 ('e7d46d00-e32e-417b-8628-0f5287d55840', '<', 1023, 1866)]

But if I were to remove the second at random and try again, I'd get the first with a bunch of tiny sequences (the TGA, AGT ones):

[('81a92bdc-2b71-48de-bdfc-fafcf9bf26ed', '>', 0, 1023),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1023, 1026),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1052, 1055),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1088, 1091),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1124, 1127),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1127, 1130),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1148, 1151),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1193, 1196),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1225, 1228),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1322, 1325),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1355, 1358),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1394, 1397),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1418, 1421),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1477, 1480),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1496, 1499),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1501, 1504),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1571, 1574),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1595, 1598),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1637, 1640),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1690, 1693),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 1753, 1756),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1760, 1763),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1778, 1781),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1832, 1835),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1841, 1844),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 1853, 1856)]

And if I were to remove the first, I'd get a different answer:

[('129e6622-1f82-4de0-a24d-427d513f005d', '<', 19, 22),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 104, 107),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 182, 185),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 205, 208),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 226, 229),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 250, 253),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 284, 287),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 310, 313),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 326, 329),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 475, 478),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 505, 508),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 515, 518),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 529, 532),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 622, 625),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 632, 635),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 715, 718),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 731, 734),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 736, 739),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 748, 751),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 752, 755),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 772, 775),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '<', 817, 820),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 919, 922),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 989, 992),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 1004, 1007),
 ('129e6622-1f82-4de0-a24d-427d513f005d', '>', 1020, 1023),
 ('e7d46d00-e32e-417b-8628-0f5287d55840', '<', 1023, 1866)]

So - intuition is telling me that the first is the "best" answer and if we start trying to account for some of these alternate solutions that combine tiny parts (length 3) the database is going to get messy. Maybe in the future there might be some way to do this query and find smaller parts in the larger ones, but for the definition of a composite part, it seems most correct to include the largest parts.

This can be run on the client, and then the part ids, directions are submit to the server, and the server just needs to validate them, which is totally reasonable.

vsoch commented 5 years ago

hey @Koeng101 this is done, as a logged in user, you can see the create endpoint here. You should first update your local freegenes-python to be version 0.0.13:

# do this until it tells you it's not installed
$ pip uninstall freegenes

Then do

$ pip install freegenes=0.0.13

and the function for the client to take a name, and sequence and derive a local cache, figure out directions and parts, is documented here. Also notice the linked issue above that we would eventually want unit tests for the algorithm - I thought this might be something @shea256 might want to tackle. I'm going to close the issue here, and please re-open for small tweaks / fixes to it.

shea256 commented 5 years ago

@vsoch wrote:

So - intuition is telling me that the first is the "best" answer and if we start trying to account for some of these alternate solutions that combine tiny parts (length 3) the database is going to get messy. Maybe in the future there might be some way to do this query and find smaller parts in the larger ones, but for the definition of a composite part, it seems most correct to include the largest parts.

Curious what @koeng101 thinks here but it sounds like it depends what the purpose is.

I am out of my league on the biology here but with my limited understanding...

I could imagine a composite part comprised of a coding sequence and a promoter and a terminator for example, and then another larger composite that includes these items and other additional items.

In this case it would be more useful to see all of the base level constituent parts rather than the composite parts within the higher order composite part.

The counter to this is I could imagine certain coding sequences being subsets of other coding sequences that make up entirely different proteins. Or even promoters that are subsets of other promoters.

Once again, I’m a novice on the biology here but this is just some logic I’m trying to throw against the wall and see if it has a basis in reality.

@koeng101 @vsoch thoughts? Am I making sense here?

vsoch commented 5 years ago

@shea256 yes it could definitely be the case that the list of parts includes some higher level parts, and then smaller constituent parts. The question (probably for @Koeng101) is which should be represented for the Composite Part. Or if more than one should (this gets messy fast). Regardless of this decision, the path forward is fairly logical, here is what I would do:

first review the algorithm with @Koeng101 - I came up with it without much biology background and short conversation with Keoni, so it definitely could need some tweaking. For example, as we just discussed, it could be that the longest parts aren't the "best" but some other metric is.
work with @Koeng101 to develop a set of "gold standard" test cases. I've provided the base part data to add with the repository for this case.

Given the test cases and the list of "gold standard" criteria, this should be enough to guide development, and then verify if everything works as expected with the unit tests. If there are any changes needed to the server (e.g., if you decide to maintain all possibilities for a composite part from a new sequence, what does that mean for the post?) please outline with the changes so I can make them and test against your updated freegenes-python locally. Of course you could do this too! Thanks for your thoughts @shea256! Could you please move discussion to https://github.com/vsoch/freegenes-python/issues/8 since this is where development will be happening?

shea256 commented 5 years ago

@vsoch Thanks for the response, moving the conversation to https://github.com/vsoch/freegenes-python/issues/8.

Koeng101 commented 5 years ago

parts['129e6622-1f82-4de0-a24d-427d513f005d']['optimized_sequence']

That's a data error on my part! I'll fix that. My thought there is that we could only have sequences analyzed that are in our parts list.

So - intuition is telling me that the first is the "best" answer and if we start trying to account for some of these alternate solutions that combine tiny parts (length 3) the database is going to get messy

For example, as we just discussed, it could be that the longest parts aren't the "best" but some other metric is.

We could default to that a part must be over 20 base pairs to be automatically annotated. That's approximately the smallest size of relevant parts. Other than that, yea length is a good heuristic (not perfect, but good enough for this!)

Awesome way to annotate parts btw!

(Going to vsoch/freegenes-python#8 to continue conversation)

vsoch / freegenes

CompositePart API integration #63