FPKM values don't sum up to 1mio

MassaBob commented 5 years ago

Hi, I am struggling a little with the FPKM output of Stringtie (StringTie v1.3.5). From my understanding which is supported by your response to issue #15 (https://github.com/gpertea/stringtie/issues/15) the sequencing depth normalization is done by dividing through reads mapping and assigned to the provided transcript model (in -e mode). Thus, all FPKM should sum up to 1 000 000 independent of whether some mapped reads could not be assigned to the transcripts provided with -G (btw: TPM values indeed sum up to 1mio). I first stumbled across this using a poorly annotated organism (sums between 200.000 and 800.000). Using stringtie transcript assembly and the resulting gtf for -e analysis cures this a little by bringing samples closer together (800000-1.1mio). Now, I also encountered the same with a human sample set (sums between 400000 and 550 000) although human gtf should be quite complete (used the gencode set for hg 38). Do you have an explanation/solution. Is it reasonable to normalize every FPKM again by dividing by the sum of FPKMs for each sample*1000000? Thanks for your help!

sklages commented 5 years ago

Interesting. No comments on this? I would also expect the sum to be 1mio...

MassaBob commented 5 years ago

Any ideas on that?

KC-Lan commented 5 years ago

Hi MassaBob, I'm not sure if this answer you question, but actually the sum of all genes FPKM/RPKM in one sample does not need to be 1 million. TPM is the one that always sum up to 1 million.

You can check the original formula of FPKM/RPKM and TPM calculation. (Video: "RPKM, FPKM and TPM, clearly explained" https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/)

One more thing to notice is that FPKM/RPKM may not be a good method for estimating gene expression. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics DOI:10.1093/bib/bbs046.

The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Only the DESeq and TMM normalization methods are robust to the presence of different library sizes and widely different library compositions, both of which are typical of real RNA-seq data.

Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory in Biosciences. (2012). DOI: 10.1007/s12064-012-0162-3

MassaBob commented 5 years ago

Thanks for the links. According to (https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/) the difference might result from the total reads/fragments that are used for normalisation. For TPM it uses only frags that have been assigned to genes/transcripts (for RPK -calculation). For FPKM maybe also reads that have not been mapped to genes are used for normalisation thus if you sum up all mapped reads you don't end up with 1mio. Unfortunately, this is not clearly stated in the stringtie manual- at least I cannot find it.

gpertea / stringtie

FPKM values don't sum up to 1mio #213