nleroy917 / optipyzer

Multi-Species Codon Optimization Engine
https://optipyzer.com
Apache License 2.0
25 stars 6 forks source link

Output DNA sequence does not code for correct input protein #57

Open annb-lab opened 1 year ago

annb-lab commented 1 year ago

Output DNA sequence does not always translate to the correct input protein to be optimized.

Reproducible steps:

Click Protein

Select E. coli, and select Yeast

Paste sequence: MGSSHHHHHHSSGLVPRGSHMGSMAAPSDGFKPRERSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASLETVKVGKTYELLNCDKHKSILLKNGRDPGEARPDITHQSLLMLMDSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSVRAADGPQKLLKVIKNPVSDHFPVGCMKVGTSFSIPVVSDVRELVPSSDPIVFVVGAFAHGKVSVEYTEKMVSISNYPLSAALTCAKLTTAFEEVWGVI

Optimize!

The optimized sequence SD is incorrect:

ATGGGTTCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGATTAGTTCCAAGGGGTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGAGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGAGCAGGAAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGAGCATCTAGTTTAGAAACTGTTAAAGTTGGTAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACACAGAAAAATGTTTTAATTGAAGTTAATCCACAAACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAATTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGTCCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGAACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTCGTTGTTGGTGCATTTGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA

Which translates to:

MGSSSSHHHHHHSSSSGLVPRGSHMGSSMAAPSSDGFKPRERSSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASSLETVKVGKTYELLNCDKHKSSILLKNGRDPGEARPDITHQSSLLMLMDSSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSSVRAADGPQKLLKVIKNPVSSDHFPVGCMKVGTSSFSSIPVVSSDVRELVPSSSSDPIVFVVGAFAHGKVSSVEYTEKMVSSISSNYPLSSAALTCAKLTTAFEEVWGVI

Hopefully this helps you debug the program!

A few notes. Changing the 'weight' to 2 for each species yields the correct sequence. Choosing E. coli and Human yields the correct sequence.

nleroy917 commented 1 year ago

Hi @annb-lab thanks for opening this. I've been playing around with things trying to figure this one out. Quick question: you said

The optimized sequence SD is incorrect

Can you verify this for me? I am finding that the AD sequence is the incorrect one. I'm also finding that setting both of the weights to 2 does not solve the problem. Thanks!

annb-lab commented 1 year ago

Both sequences are wrong for me.

AD: ATGGGATCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGTTTAGTTCCAAGGGGTTCTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGTGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGTGCAGGTAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGTGCATCTAGTTTAGAAACTGTTAAAGTTGGAAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACTCAGAAAAATGTTTTAATTGAAGTTAATCCACAGACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAGTTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGACCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGTACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTTGTTGTTGGAGCATTCGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA

SD: ATGGGATCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGATTAGTTCCAAGGGGTTCTAGTCATATGGGATCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGAGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGTGCAGGTAATAAAATTGGAGGTAGGAGGTTAATTGTTGTTTTAGAAGGTGCATCTAGTTTAGAAACTGTTAAAGTTGGTAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGAAGGGATCCAGGTGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACTCAGAAAAATGTTTTAATTGAAGTTAATCCACAGACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAGTTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGTCCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGTACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTCGTTGTTGGAGCATTTGCACATGGAAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTCGAAGAAGTTTGGGGAGTTATTTAA

I found a reduced example: Select E. coli and Yeast as before Use S as the sequence Both AD and SD give: TCTAGTTAA

And yes, both putting the weights to 2 in the reduced example also doesn't fix the problem. I might have made a mistake on that debugging but I was playing with a lot of things.

malott3 commented 3 months ago

I am using the web interface optimizing for both E. coli and Yeast equally weighted at 1. I am getting the same issue with the S being doubled sometimes. Both the AD and SD translate wrong however the protein sequence is correct. The issue repeats itself with different sequences and occurs whether starting from a protein or DNA input sequence.

Thanks so much for your effort in putting this together. This is such a useful and powerful tool.

Have a great day!

nleroy917 commented 3 months ago

@malott3

Hello! Thank you for this information! I'm working hard to wrap up a project and hope to focus my efforts on this very soon. It can be tough since the code is >5 years old and was only half-written by me 😭

Promise to keep looking into it, and I appreciate any information you can provide!

malott3 commented 3 months ago

@nleroy917

Thanks again! I wish you good luck on your project! Yeah, it sounds difficult.

Not sure if this is helpful information or not. Yesterday I had tried clearing the cache and restarting Microsoft edge to no avail. I tried today and it seems to be working correctly. I cannot see a difference. I will let you know if it repeats again.

Have a great afternoon,

Thomas

malott3 commented 3 months ago

Just an update as promised. I know this is not terribly helpful, but it seems time dependent. It was working great, and then a sequence had all of the S's double. I waited 15 minutes, and it worked just fine. I did several other sequences just fine. Now a half hour later it is doubling every S again for every sequence. I still do not see a correlation to anything useful. Sorry about that. For now, I will just time it or manually remove the extra S's. Still a great tool. Thanks!

Enjoy your day!

malott3 commented 2 months ago

I have another observations which might help. The tool seems to be not optimizing properly when optimizing for both Yeast and E. coli. I noticed that some amino acids were always using the same codon even if there was not an obvious reason. I pulled the S. cerevisiae and E. coli codon tables and did a weighted average, zeroing everything <10%. The attached image shows a comparation of the codon distribution for 15 optimized genes (Total of 5,270 codons) using the web tool. As you can see the distribution is not even.

Codon distribution

Thanks again for this tool. Not trying to critique, just offering feedback to help identify the possible error.

Have a great day,

Thomas

nleroy917 commented 2 months ago

Super interesting! Thanks so much for this data.

A little lore/background: the original algorithm was written back in 2019 by my colleague Caleigh. She did an amazing job documenting it all. However, that implementation remains largely untouched in 5 years... I've gone through it all myself, and there isn't much that appears incorrect. But, it is a 5-year-old Python without type annotations, and it can be quite dense at times (three-layer nested for loops everywhere).

I'm currently trying to go through our paper and reimplement the algorithm in Rust; slowly but surely. The new Rust implementation should be safer, faster, and enable the optimization to be done in-browser.

Anyways, the fact that there is such an enrichment for a specific codon for amino acids like arginine or isoleucine makes me think that the table is incorrectly being calculated