Possible inconsistencies in pathway completeness percentages.

avandieren commented 1 month ago

for bug reports and errors please report the output of: ./gapseq test Hi,

This is an amazing tool!

I am doing a comparative metabolic analysis among several species of symbiotic bacteria, and when I look the individual all-Pathways.tbl files, for several of the pathways, the percent completeness based on reactions found is above the 66% threshold, yet the "Prediction" is still labeled as "FALSE." Is this a glitch or could you explain why this may be? issue_1

Secondly, for some of the pathways, there will be a completeness percentage given, but no corresponding reactions found. I thought it may have to do with the "vague reactions found", but this result seems to be inconsistent from the rest. Is this a glitch or am I missing something? issue_2

Thanks

Calvin2077 commented 1 month ago

From my experiences a potential answer to your first question is due to your pathways within your species missing key enzymes or having vague reactions (see below)

A potential answer to your second is 'vague reactions' are "Number of reactions without available sequences" (obtained from GapSeq tutorial pages) More specifically, "In cases in which no sequence data is available for specific reactions, the status of the reactions is set to ‘vague’ and these reactions do not count as missing if they account for less than vagueCutoff of the total reactions of a pathway or subsystem. We used a value of 1/3 for this threshold." (GapSeq publication article)

These are just my (a big user of GapSeq) understanding of the results I am not a developer of this application but I totally wish I was

Waschina commented 1 week ago

Hi,

Thanks for reporting this. I can confirm that there is an inconsistency in the completeness calculations.

In principle, the completeness (C) is calculated as:

$$ C = \frac{N{found} + N{vague}}{N{pwy} - N{spont}} $$

Where $N{found}$ is the number of reactions found, $N{vague}$ is the number of reactions in the pathway without reference sequences, $N{pwy}$ is the number of all reactions in the pathway and $N{spont}$ the number of spontaneous reactions in the pathway.

It seems that the number of vague reactions is not consistently used in the calculations.

An example where the calculations are correct: (5 reaction in total, 0 found, 1 vague, 0 spont)

            ID                                       Name Prediction Completeness VagueReactions KeyReactions KeyReactionsFound ReactionsFound
1: FUCDEGRA-NP L-Fucose degradation (non-phosphorylating)      FALSE           20              1            0                 0

An example where it did not work: (3 reactions in total, 0 found, 2 vague, 0 spont)

                       ID                                             Name Prediction Completeness VagueReactions KeyReactions KeyReactionsFound ReactionsFound
1: INULIN-DEGRADATION-EXO Inulin degradation(11xFru,1xGlc) (extracellular)      FALSE            0              2            1                 0

We are working on a fix.

jotech / gapseq

Possible inconsistencies in pathway completeness percentages. #216