jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
357 stars 78 forks source link

TPM and RPKM calculation inconsistent #640

Closed mkariush closed 1 year ago

mkariush commented 1 year ago

Hi, I have a question concerning the way TPM and RPKM is calculated. I can reproduce TPM and RPKM values in the 13.projectname.orftable for each ORF. Then, when I take the 12.projectname.kegg.funcover I can reproduce the RPKM but not TPM. Moreover, when I create SQM object, TPM in SQMobject$functions$kegg$tpm is different from that in 12.projectname.kegg.funcover. Can you please give me some advice on this? Thanks! Mariusz

fpusan commented 1 year ago

The TPMs present in the 12.projectname.kegg.funcover table and in the SQMtools object are indeed calculated slightly differently, as follows:

So that hopefully answers part of the question. This is mentioned in passing in the PDF manual, in the section for the sqm2tables.py script, and also in #115, but the full story was undocumented I think.

However it seems weird that you could reproduce RPKMs and TPMs in table 13, and RPKMs in table 12... but not TPMs in table 12. Better to look into it. Can you share your code?

mkariush commented 1 year ago

Hi, thank for the quick response and clarifying the calculations but I have to digest it. In the meantime this is how I calculate TPM and RPKM in tables 13 and 12. Here there are the first three rows of my 13.projectname.orftable (with some columns deleted for the sake of presentation). Now, for the TPM first I'm doing a per kilobase transformation : Raw read count divided by Length NT (in kilobases) and then a per milion transformation, which is the result of a per kilobase transformation 1,000,000 divided by the sum of per kilobase transformations for all ORFs (2810884.359). So 554/1.0411,000,000/2810884.359=189.33

ORF ID | Contig ID | Molecule | Method | Length NT | Length AA | GC perc | Gene name | KEGG ID | COG ID | TPM PL3 | Coverage PL3 | Raw read count PL3 | Raw base count PL3 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- canu_2_1792294-1793334 | canu_2 | CDS | Prodigal | 1041 | 347 | 57.4 | adh | K00001* | COG1063* | 189.33 | 501.163 | 554 | 521711 canu_2_589954-590454 | canu_2 | CDS | Prodigal | 501 | 167 | 54.5 | adh | K00001* | COG1063* | 249.25 | 327.974 | 351 | 164315 canu_2_590594-591004 | canu_2 | CDS | Prodigal | 411 | 137 | 50.6 | adh | K00001* | COG1063* | 298.63 | 328.883 | 345 | 135171

Here, the first three rows of the 12.projectname.kegg.funcover (some columns deleted). Since in the above table, the first three ORFs were annotated with K00001, the Total length, Total read and Total bases are the sum of theses rows. And I can reproduce RPKM: 1250 (Total reads)*1000000/1167199 (the sum of all Raw read count from the table 13)/1.953(Total length in kilobases)=548.36. Now when I use the same method as described above for the TPM I get 794.36, not 1303.26. That is 1250/1.953*1000000/805730.9508 (the sum of per kilobase transformations for all ORFs).

kegg ID | Sample | Copy number | Total length | Total reads | Total bases | RPKM | Coverage | TPM -- | -- | -- | -- | -- | -- | -- | -- | -- K00001 | PL3 | 3 | 1953 | 1250 | 821197 | 548.356 | 420.48 | 1303.26 K00002 | PL3 | 2 | 2127 | 892 | 864914 | 359.296 | 406.64 | 569.284 K00003 | PL3 | 4 | 1911 | 1567 | 716278 | 702.528 | 374.82 | 2226.231
mkariush commented 1 year ago

Hi fpusan, thanks for all the information. I was very helpful. And when it comes to TPM calculation everything is fine. I have just realised (by looking at the 12.funcover.pl script) that the total length is normalised to the copy number. After that I can reproduce it nicely.

fpusan commented 1 year ago

Ah yes, we use the average length of a feature (e.g. a COG) in our dataset as the basis for RPKM/TPM normalization. Glad everything is consistent now!