TPM and RPKM calculation inconsistent

mkariush commented 1 year ago

Hi, I have a question concerning the way TPM and RPKM is calculated. I can reproduce TPM and RPKM values in the 13.projectname.orftable for each ORF. Then, when I take the 12.projectname.kegg.funcover I can reproduce the RPKM but not TPM. Moreover, when I create SQM object, TPM in SQMobject$functions$kegg$tpm is different from that in 12.projectname.kegg.funcover. Can you please give me some advice on this? Thanks! Mariusz

fpusan commented 1 year ago

The TPMs present in the 12.projectname.kegg.funcover table and in the SQMtools object are indeed calculated slightly differently, as follows:

It so happens that some genes are annotated with more than one KEGG/COG. This is not because they have different hits in the reference database, but because a single protein in the KEGG or COG databases can actually be annotated with more than one KEGG/COG (e.g the annotation would be something like COG0001;COG0002).
So when we are calculating aggregated counts by function and see a read mapping to an ORF annotated with "COG0001;COG0002", we have three options. 1) We can treat COG0001;COG0002 as an entity independent from COG0001 and COG0002, and assign a read to it. The counts assigned are thus COG0001;COG0002=1 COG0001 = 0 COG0002 = 0. This would however not be consistent with COG, which treats them as two different entities (even if they can co-occur in the same protein). 2) We can add one count to COG0001 and another to COG0002. This can be argued to be true, since we indeed saw both COGs once. But comes with the problem that from a single read mapping to a single ORF, we have now added two counts to our count table. 3) We can add half a count to COG0001 and another half to COG0002. This is weird since it involves fractional counts, which are not a real thing. But has the advantage of preserving the total number of counts in the table so it matches the total number of mapped reads.
The main SqueezeMeta pipeline uses option 2), but when developing SQMtools we became aware that count tables from different origins (contigs, orfs, bins, taxonomic aggregates, functional aggregates) were going to be integrated into the same object and explored at the same time. We thought that option 3), which results in all tables having the same number of counts per sample, felt more consistent.

So that hopefully answers part of the question. This is mentioned in passing in the PDF manual, in the section for the sqm2tables.py script, and also in #115, but the full story was undocumented I think.

However it seems weird that you could reproduce RPKMs and TPMs in table 13, and RPKMs in table 12... but not TPMs in table 12. Better to look into it. Can you share your code?

mkariush commented 1 year ago

Hi, thank for the quick response and clarifying the calculations but I have to digest it. In the meantime this is how I calculate TPM and RPKM in tables 13 and 12. Here there are the first three rows of my 13.projectname.orftable (with some columns deleted for the sake of presentation). Now, for the TPM first I'm doing a per kilobase transformation : Raw read count divided by Length NT (in kilobases) and then a per milion transformation, which is the result of a per kilobase transformation 1,000,000 divided by the sum of per kilobase transformations for all ORFs (2810884.359). So 554/1.0411,000,000/2810884.359=189.33

ORF ID | Contig ID | Molecule | Method | Length NT | Length AA | GC perc | Gene name | KEGG ID | COG ID | TPM PL3 | Coverage PL3 | Raw read count PL3 | Raw base count PL3 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- canu_2_1792294-1793334 | canu_2 | CDS | Prodigal | 1041 | 347 | 57.4 | adh | K00001* | COG1063* | 189.33 | 501.163 | 554 | 521711 canu_2_589954-590454 | canu_2 | CDS | Prodigal | 501 | 167 | 54.5 | adh | K00001* | COG1063* | 249.25 | 327.974 | 351 | 164315 canu_2_590594-591004 | canu_2 | CDS | Prodigal | 411 | 137 | 50.6 | adh | K00001* | COG1063* | 298.63 | 328.883 | 345 | 135171

Here, the first three rows of the 12.projectname.kegg.funcover (some columns deleted). Since in the above table, the first three ORFs were annotated with K00001, the Total length, Total read and Total bases are the sum of theses rows. And I can reproduce RPKM: 1250 (Total reads)*1000000/1167199 (the sum of all Raw read count from the table 13)/1.953(Total length in kilobases)=548.36. Now when I use the same method as described above for the TPM I get 794.36, not 1303.26. That is 1250/1.953*1000000/805730.9508 (the sum of per kilobase transformations for all ORFs).

kegg ID | Sample | Copy number | Total length | Total reads | Total bases | RPKM | Coverage | TPM -- | -- | -- | -- | -- | -- | -- | -- | -- K00001 | PL3 | 3 | 1953 | 1250 | 821197 | 548.356 | 420.48 | 1303.26 K00002 | PL3 | 2 | 2127 | 892 | 864914 | 359.296 | 406.64 | 569.284 K00003 | PL3 | 4 | 1911 | 1567 | 716278 | 702.528 | 374.82 | 2226.231

mkariush commented 1 year ago

Hi fpusan, thanks for all the information. I was very helpful. And when it comes to TPM calculation everything is fine. I have just realised (by looking at the 12.funcover.pl script) that the total length is normalised to the copy number. After that I can reproduce it nicely.

fpusan commented 1 year ago

Ah yes, we use the average length of a feature (e.g. a COG) in our dataset as the basis for RPKM/TPM normalization. Glad everything is consistent now!

jtamames / SqueezeMeta

TPM and RPKM calculation inconsistent #640