mt1022 / cubar

R Package for Codon Usage Bias Analysis
https://mt1022.github.io/cubar/
Other
6 stars 2 forks source link

Codon optimisation #7

Closed maltesemike closed 3 months ago

maltesemike commented 5 months ago

Thank you for the great package and easy to follow instructions.

I've managed to run through the tutorial with my own non-model genome now.

Is there a tool to automatically codon optimise a desired sequenced based on the optimal codons cubar calculates?

mt1022 commented 5 months ago

Currently, cubar has no option for codon optimization. Do you want to replace each codon in a CDS to the optimal synonymous one? I could consider add such a function.

However, I have to mention that each codon is optimal does not mean the whole CDS is optimal. Some non-optimal codons are used on purpose, for example, to slow elongation and allow for correct co-translational folding of the nascent peptide chain.

maltesemike commented 4 months ago

That was my plan yes. I am trying to introduce a fluorescent protein transgene into our model system and my fear is that codon usage might differ, so I was hoping to optimise it to suit our system as best as possible.

A tool to optimise codon usage would this be really useful, although I did not think about your sentiments on slowing the rate of elongation. This also makes sense. Is there a way to account for this too?

I am guessing that improving the CAI of a particular CDS to match that of the most highly expressed genes in the genome would be a good start already.

mt1022 commented 4 months ago

Hi, I added a new function called codon_optimize, which replaces each codon with its optimal counterpart. Please try it out, and any feedback would be greatly appreciated.

A tool to optimise codon usage would this be really useful, although I did not think about your sentiments on slowing the rate of elongation. This also makes sense. Is there a way to account for this too?

I am afraid that there is no simple rule to do such optimization.

maltesemike commented 4 months ago

Thanks for adding this feature, it is extremely useful! A few things I've noticed:

  1. The optimised CDS does not have the original stop codon appended, but seems to add an "NA" dinucleotide instead.

  2. Regarding the rule to optimise codons. What are the rules for this? After inspecting my optimised sequence, I clearly see a large increase in CAI to a value matching the right hand side of the CAI bell curve from highly expressed genes. On closer inspection of the optimised sequence, it seems that the nucleotides are not always changed to the best one from the differential usage analysis (ie.e. based on the her vs leg analysis). In some cases, they are changed to a more "poorly" used codon (ie. OR value less than 1). I am guessing this is by design? What is the rule for the change?

mt1022 commented 4 months ago

Hi, thanks for the feedback.

  1. The optimization function does not optimize stop codons, which should be removed beforehand. I should have included this in the documentation.
  2. Not by design :( Currently the optimal codons are determined with a binomial regression described here. Basically, it find the codon that is most likely to be used in genes with high codon usage bias (i.e., low ENC). Thus, it is possible that optimal codons determined this way might be different from those more likely used in highly expressed genes. However, I guess they are consistent in most cases. After obtaining the optimal codons, each codon was replaced to the optimal one of the corresponding codon family.

Besides, I was considering update this function so that users can determined optimal codons by gene expression or provide a predefined list of optimal codons. The current function is useful when there is no genome-wide expression data.