metageni / SUPER-FOCUS

A tool for agile functional analysis of shotgun metagenomic data
GNU General Public License v3.0
21 stars 12 forks source link

Superfocus normalization? #52

Closed Thomieh73 closed 4 years ago

Thomieh73 commented 4 years ago

Hey, I am just trying to wrap my head around the normalization done by superfocus. As I understand it following closed issued #46, if a read is matching two subsystems the hit will be split, 0.5 to one and 0.5 to another subsystem. So for that particular read both subsystems than have the exact same score.

And for counting the abundance of a subsystem, you summarize all the matches to a particular subsystem. So one read had only one match to subsystem A, giving it a score of 1, while another read matched to A, B and C and thus gets for the match to A only a score of 0.3333. Then the total score to subsystem A is : 1.333.

So I tried to find in you manuscript what the reason is for this way of normalizing the data, but I did not find it? Have you any references that discuss this way of calculating abundances? And why is this preferred over taking the top hit of a search.

Would love to see a bit more on this.

Why would that be preferred over the actual hits.

metageni commented 4 years ago

@Thomieh73 Sorry for the late answer - for some reason I just saw it new.

Short answer: no big reason behind the normalization. If you can use another flavor of the normalization by using -n

The rationale behind this is just the different styles for normalization data. SUPER-FOCUS supports two flavors, the one you described, one read hit two subsystems, so each subsystem gets 0.5

However, the tool also provided the normalization where each subsystem would get 1 in the counts.

Please play with the parameter -n.

Sorry again for the late answer.

metageni commented 4 years ago

I guess this was addressed in another thread.