Open annb-lab opened 1 year ago
Hi @annb-lab thanks for opening this. I've been playing around with things trying to figure this one out. Quick question: you said
The optimized sequence SD is incorrect
Can you verify this for me? I am finding that the AD sequence is the incorrect one. I'm also finding that setting both of the weights to 2 does not solve the problem. Thanks!
Both sequences are wrong for me.
AD: ATGGGATCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGTTTAGTTCCAAGGGGTTCTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGTGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGTGCAGGTAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGTGCATCTAGTTTAGAAACTGTTAAAGTTGGAAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACTCAGAAAAATGTTTTAATTGAAGTTAATCCACAGACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAGTTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGACCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGTACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTTGTTGTTGGAGCATTCGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA
SD: ATGGGATCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGATTAGTTCCAAGGGGTTCTAGTCATATGGGATCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGAGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGTGCAGGTAATAAAATTGGAGGTAGGAGGTTAATTGTTGTTTTAGAAGGTGCATCTAGTTTAGAAACTGTTAAAGTTGGTAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGAAGGGATCCAGGTGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACTCAGAAAAATGTTTTAATTGAAGTTAATCCACAGACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAGTTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGTCCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGTACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTCGTTGTTGGAGCATTTGCACATGGAAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTCGAAGAAGTTTGGGGAGTTATTTAA
I found a reduced example: Select E. coli and Yeast as before Use S as the sequence Both AD and SD give: TCTAGTTAA
And yes, both putting the weights to 2 in the reduced example also doesn't fix the problem. I might have made a mistake on that debugging but I was playing with a lot of things.
I am using the web interface optimizing for both E. coli and Yeast equally weighted at 1. I am getting the same issue with the S being doubled sometimes. Both the AD and SD translate wrong however the protein sequence is correct. The issue repeats itself with different sequences and occurs whether starting from a protein or DNA input sequence.
Thanks so much for your effort in putting this together. This is such a useful and powerful tool.
Have a great day!
@malott3
Hello! Thank you for this information! I'm working hard to wrap up a project and hope to focus my efforts on this very soon. It can be tough since the code is >5 years old and was only half-written by me ðŸ˜
Promise to keep looking into it, and I appreciate any information you can provide!
@nleroy917
Thanks again! I wish you good luck on your project! Yeah, it sounds difficult.
Not sure if this is helpful information or not. Yesterday I had tried clearing the cache and restarting Microsoft edge to no avail. I tried today and it seems to be working correctly. I cannot see a difference. I will let you know if it repeats again.
Have a great afternoon,
Thomas
Just an update as promised. I know this is not terribly helpful, but it seems time dependent. It was working great, and then a sequence had all of the S's double. I waited 15 minutes, and it worked just fine. I did several other sequences just fine. Now a half hour later it is doubling every S again for every sequence. I still do not see a correlation to anything useful. Sorry about that. For now, I will just time it or manually remove the extra S's. Still a great tool. Thanks!
Enjoy your day!
I have another observations which might help. The tool seems to be not optimizing properly when optimizing for both Yeast and E. coli. I noticed that some amino acids were always using the same codon even if there was not an obvious reason. I pulled the S. cerevisiae and E. coli codon tables and did a weighted average, zeroing everything <10%. The attached image shows a comparation of the codon distribution for 15 optimized genes (Total of 5,270 codons) using the web tool. As you can see the distribution is not even.
Thanks again for this tool. Not trying to critique, just offering feedback to help identify the possible error.
Have a great day,
Thomas
Super interesting! Thanks so much for this data.
A little lore/background: the original algorithm was written back in 2019 by my colleague Caleigh. She did an amazing job documenting it all. However, that implementation remains largely untouched in 5 years... I've gone through it all myself, and there isn't much that appears incorrect. But, it is a 5-year-old Python without type annotations, and it can be quite dense at times (three-layer nested for loops everywhere).
I'm currently trying to go through our paper and reimplement the algorithm in Rust; slowly but surely. The new Rust implementation should be safer, faster, and enable the optimization to be done in-browser.
Anyways, the fact that there is such an enrichment for a specific codon for amino acids like arginine or isoleucine makes me think that the table is incorrectly being calculated
Output DNA sequence does not always translate to the correct input protein to be optimized.
Reproducible steps:
Click Protein
Select E. coli, and select Yeast
Paste sequence: MGSSHHHHHHSSGLVPRGSHMGSMAAPSDGFKPRERSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASLETVKVGKTYELLNCDKHKSILLKNGRDPGEARPDITHQSLLMLMDSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSVRAADGPQKLLKVIKNPVSDHFPVGCMKVGTSFSIPVVSDVRELVPSSDPIVFVVGAFAHGKVSVEYTEKMVSISNYPLSAALTCAKLTTAFEEVWGVI
Optimize!
The optimized sequence SD is incorrect:
ATGGGTTCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGATTAGTTCCAAGGGGTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGAGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGAGCAGGAAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGAGCATCTAGTTTAGAAACTGTTAAAGTTGGTAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACACAGAAAAATGTTTTAATTGAAGTTAATCCACAAACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAATTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGTCCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGAACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTCGTTGTTGGTGCATTTGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA
Which translates to:
MGSSSSHHHHHHSSSSGLVPRGSHMGSSMAAPSSDGFKPRERSSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASSLETVKVGKTYELLNCDKHKSSILLKNGRDPGEARPDITHQSSLLMLMDSSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSSVRAADGPQKLLKVIKNPVSSDHFPVGCMKVGTSSFSSIPVVSSDVRELVPSSSSDPIVFVVGAFAHGKVSSVEYTEKMVSSISSNYPLSSAALTCAKLTTAFEEVWGVI
Hopefully this helps you debug the program!
A few notes. Changing the 'weight' to 2 for each species yields the correct sequence. Choosing E. coli and Human yields the correct sequence.