What kind of normalization does SUPER-FOCUS perfom?

adlape95 commented 5 years ago

Hi,

I have just installed and tested SUPER-FOCUS on a metagenomic dataset. Everything worked nice and quick, but when I opened the output I found that the counts assigned to each subsystem were float numbers (I expected integers).

After reading the manual and the paper (including the methodological paper), I found that SUPER-FOCUS performs some kind of normalization, but I do not know for sure how it carries out that normalization.

I want to use the output to perform some PCA analysis (I think I will use the relative data for avoiding library size bias) and to detect significant features between groups by using edgeR or DESeq. In this case, would you suggest to use normalized data or raw counts?

Thank you very much in advance.

metageni commented 5 years ago

Hi @adlape95 - this is a common question.

You are not having an integer number in the assignment number because the read sequence hit multiple subsystems, so the program normalizes the count by the number of subsystems. For example, if the "sequence A" hits "subsystem A" and "subsystem B", the program counts 0.5 into "subsystem A" and 0.5 into "subsystem B". Looks like you want an integer number as count, so you would need to run superfocus with the flag -n 0 which does not normalize each query count based on the number of hits.

Best

adlape95 commented 5 years ago

Perfect. All clear now. Thank you very much.

Best.

metageni / SUPER-FOCUS

What kind of normalization does SUPER-FOCUS perfom? #46