jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
357 stars 78 forks source link

Bin TPM #639

Closed sheaster closed 1 year ago

sheaster commented 1 year ago

Hi There, thanks again for this great set of tools!

I think I may be misunderstanding the output in some of the results files.

I naively assumed that if you read in all Contigs IDs for each bin (from the fasta or from the contigfile) and merge that with the orftable, the sum of TPMs for all orftable ORFs that match to all Contig IDs for a bin in one sample would equal the TPM reported for the specified bin and sample in the bintable.

However, I find different sums for a sample's bins in the bintable TPM, contigtable TPM, and orftable TPM.

I've attached an example for a sample and a bin.

Am I misinterpreting the TPM in each of these files? Should they not be equivalent?

Cheers and thanks squeezemeta_compare_bin_sums.pdf !

sheaster commented 1 year ago

hi again, just wondering if I'm making a simple error and trying to compare completely different TPM totals or something like that?

fpusan commented 1 year ago

Hi! Sorry I missed this The TPM of a bin is not the sum of the TPMs of its constituent contigs (same as the TPM of a contig is not the sum of the TPMs of its constituent ORFs) . Bin TPMs are calculated independently, ignoring the reads not assigned to any bins. If you want to track the abundance of bins across samples, I would rather recommend using coverages per million, accesible through SQMtools.

sheaster commented 1 year ago

No worries! I see. Typically, within one sample, I would look to sum up the >0 TPMs of ORFs within a bin to determine total bin TPM (thus necessitating the matching of bins to contigs to append a bin column to the orftable).

This is because I am interested in relative TPM of specific ORFs within a bin in a sample.

Is there a reason this way of calculating bin TPM (using the orftable) invalid?

also, is the difference between the bin/contig/orf TPMs mentioned in the SqueezeMetaManual? I looked for it but couldn't find it

thanks again

fpusan commented 1 year ago

This is not mentioned in the manual, as we follow the standard definition of TPM. However we elaborate on this in the paper for SQMtools

The TPM (transcripts per million) metric was introduced by Wagner et al. [14] as an improved way to account for gene length and sequencing depth in transcriptomic experiments: we find it equally useful in metagenomics. The TPM of a feature (be it a transcript, a gene or a functional category) is the number of times that we would find that feature when randomly sampling 1 million features, given the abundances of the different features in our sample.

Simply adding the TPMs from the ORF table is not correct, according to the definition of TPM you need to calculate it independently for the different features (orfs, contigs, functions, bins...). For starters, you are not considering the reads that map to non-coding regions of your assembly. Further, by simply adding ORF TPMs you are weighting every ORF equally, instead of giving more weight to longer ORFs.

If you want to know the relative "contribution" of a certain ORF within a bin in a sample, I would use copy numbers instead. Again from the SQMtools paper.

As an alternative to TPM, we also provide copy numbers, calculated as the ratio between the coverage of function of interest and the coverage of the RecA/RadA recombinase universal single-copy gene. [...] Details on the interpretation of those values in the different subsets are shown in ​Table1.

Also check Table 1 in the paper for details interpretations.

So you would need to do

bin = subsetBins(SQM, 'bin_of_interest')
orf.cov = bin$orfs$cov['gene_of_interest',]
RecA.cov = bin$functions$COG$cov[RecA,]
orf.cn = orf.cov / RecA.cov

orf.cn would be a vector with the copy number of that ORF in that bin in the different samples. It would represent how many times the ORF appear per genome of the bin.

If you are interested in the copy_number of a particular function, rather than a single ORF, then this calculation will have been done automatically for you, and the result will be present in bin$functions$COG$copy_number. Just make sure that you are not getting the message RecA is not present in this subset. Will not rescale copy numbers, otherwise you'll have to recalculate copy numbers manually. Check also the SQMtools documentation with ?USiCGs, ?MGOGs and ?MGKOs for more advanced ways of calculating copy numbers.