jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
372 stars 80 forks source link

Calculating the RPKM value for species and phylum? #726

Closed lisiruisusan closed 1 year ago

lisiruisusan commented 1 year ago

Dear developers,

We can calculate the length and reads for certain species and phylum in step 11 and we can calculate the RPKM value for annotation genes in step 12. So how can I calculate the RPKM value for species and phylum?

Yours sincerely,

Lisirui

September 9th 2023

fpusan commented 1 year ago

You can not really do it easily. The thing about RPKM or TPM is that you need an estimation of feature length. This is easy to do for genes/functions (e.g. how long is this gene, or whats the everage/median length of all the genes belonging to a certain function). But it is not so straightforward for taxa. E.g. you could download all the genomes from a given phylum, calculate their average or median length, and use that for normalization during RPKM/TPM calculation. But how meaningful is that, really? It is a bit easy for bins, since at least they are concrete features with a defined length. SQMtools now tracks the coverage per million reads for the different bins. This gives you a kinda similar information to RPKM. Otherwise you can just use the percentage of reads mapping to the different phyla or species.

lisiruisusan commented 1 year ago

Thanks so much for the reply.