nleroy917 / optipyzer

Multi-Species Codon Optimization Engine
https://optipyzer.com
Apache License 2.0
23 stars 5 forks source link

codon_usage.csv How does the data in come from #63

Closed kuainaiyang closed 1 month ago

kuainaiyang commented 1 month ago

I am a person who has just started learning biology. I have looked at the data in codon_usage.csv org_id=122563 and compared it with this one https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606 Comparison reveals that the data cannot be matched. GGG 16.5(669768) !=1204747 Did I misunderstand something?

image image
nleroy917 commented 1 month ago

Hi @kuainaiyang, this is a good question. You're definitely correct those two numbers are different... They probably don't match because we are using two different source of information for codon usage stats. That database you linked looks like it contains data from 2007. Our codon usage tables come a 2019 paper. In addition, those are counts, but what really matters is the ratio of codons for each residue in a protein structure. So the total count shouldn't matter as long as the distribution remains nearly the same.

Moreover, you're looking at Glycine here with GGG, and if you crunch the numbers for Glycine as you've pointed out:

669768 / (669768 + 669873 + 903565 + 437126)

You'll get a preference of ~24.9%. Using the numbers from our database, you'll get:

1204747 / (1204747 + 1341018 + 1558367 + 846367 + 4950499)

which is ~24.3%. So, pretty darn close...

So, I think it's just different sources of codon usage, but that fact that they are so similar gives me confidence that they are both reasonably correct.