metageni / SUPER-FOCUS

A tool for agile functional analysis of shotgun metagenomic data
GNU General Public License v3.0
21 stars 12 forks source link

Functional abundance and discrepancy between Subsystem 2 and 3 #64

Closed bishav6708 closed 2 years ago

bishav6708 commented 3 years ago

Hello, 1) I was wondering how the abundance is calculated? The output has two columns '.fna' and '.fna %'.

2) Also, when I ran Superfocus in my dataset, I found discrepancy between Subsystem 2 and 3. I have 3 sets of metagenomic data. Subsystem 2 showed the presence of CO2 fixation in two sets. However, in subsystem 3, CO2 fixation was absent in two sets, while present in the other data.

Thanks!

metageni commented 3 years ago
  1. the program normalizes the count by the number of subsystems. For example, if the "sequence A" hits "subsystem A" and "subsystem B", the program counts 0.5 into "subsystem A" and 0.5 into "subsystem B". This happens if -n 1

If you don't want to normalize it, you should set the "-n 0" when you run the tool (which I think is the default)

  1. Thanks for reporting it. If you want to be more specific I can check it out and consider changing it. I just downloaded from the SEED database how they organized it.
bishav6708 commented 3 years ago

Thanks for the reply! I used the default option. I checked and the default option is -n 1 (which normalizes). I predicted the ORFs from my contigs and used them as the query. I assume the normalized abundance that SUPERFOCUS reports are the abundance calculated based on the 'NUMBER' of ORFs detected, correct? Sorry, I am a novice in bioinformatics. I am trying to determine functional diversity in my sample based on this output. I am trying to get hold of what those numbers actually mean. For instance, let us assume the input file has 100 ORFs. If the normalized abundance reported for nitrogen metabolism is 5%, does that mean that 5 ORFs were classified as nitrogen metabolism (provided there is only 1 hit for each ORF). I was just wondering if there was any mapping involved for calculating the abundance?

Apologies for the second issue. My computation had a glitch on it. I solved it. No problems on your end.

Thanks again. Really appreciate it!

metageni commented 3 years ago

I predicted the ORFs from my contigs and used them as the query. I assume the normalized abundance that SUPERFOCUS reports are the abundance calculated based on the 'NUMBER' of ORFs detected, correct?

Yep, you are right.

For instance, let us assume the input file has 100 ORFs. If the normalized abundance reported for nitrogen metabolism is 5%, does that mean that 5 ORFs were classified as nitrogen metabolism (provided there is only 1 hit for each ORF).

The % is relative to the number of ORFs which had a hit. You could have had 60% of the orfs so in that case, you would have 5/60 which suddenly becomes 8.33% of the hits mapping to the 5 orfs.

I was just wondering if there was any mapping involved for calculating the abundance?

We look into evalue (-e), minimum identity (-mi), and minimum alignment length (-ml)