jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
346 stars 81 forks source link

Normalization of abundances #859

Open inej90 opened 4 days ago

inej90 commented 4 days ago

Dear all,

I have generated tables from two analysis modes (sqm_longreads and co-assembly). I have some queries regarding the KO.abund.tsv file:

Are the tables normalized? If they are not normalized, once in R, which method should I use to normalize my data? (I currently use Z-score and I am unsure if this is the correct method or if another method should be applied.)

Thank you in advance.

Ibtissam

fpusan commented 1 day ago

Hi! These files are raw abundances and thus not normalized. The co-assembly will have additional results with different normalization methods. Regarding the best way of normalizing things, it would depend on what you want to do. Several statistical packages (e.g. DESeq2 for differential abundance analysis) will want you to provide the raw abundances, since they will do their own normalization. For doing ordinations I've been recently exploring gemelli (https://github.com/biocore/gemelli) which will take raw abundances, normalize them with a robust CLR transform and then perform a PCA. For general plotting of abundances I would just use percentages. If working with functions I normally try to use copy numbers (those will not be available when analyzing individual reads). Z-scores could also be valid, particularly for visualization purposes, but it may be tricky to run statistics on them. In general, the topic of data normalization in microbiome analysis is not trivial.