jotech / gapseq

Informed prediction and analysis of bacterial metabolic pathways and genome-scale networks
GNU General Public License v3.0
161 stars 33 forks source link

Where do I find information about subsystems and reactions #100

Closed almutwerner closed 2 years ago

almutwerner commented 2 years ago

Hello gapseq team!

-I read in the paper that gapseq was also assigning the subsystems/superpathways for each pathway and I was wondering where in output or internal database files to find that information.

I also want to look into the order of reactions within each pathway and additionally reconstruct a network based on their equations. While I was looking for this information, I came across this https://raw.githubusercontent.com/jotech/gapseq/master/dat/seed_reactions_corrected.tsv . -Are those all of the reactions, that are also in the output-Reactions.tbl, or are some missing? -What is the difference between the Columns "code" and "equation"? E.g. for rxn00008, one of them contains the H+, one doesn't. Also the arrows are sometimes different. -Which columns can I use to identify the possible directions of this reaction? I assume either "reversibility" or "direction", but I don't quite understand how to interpret them and what the difference is.

Waschina commented 2 years ago

Hi and sorry for the late response!

Subsystem information. In gapseq, the subsystems are the MetaCyc pathways that are tested in gapseq find. Here's a small R-code-snippet to extract the subsystem information from the gapseq model RDS file:

library(sybil)
library(data.table)

mod <- readRDS("model_files/anha.RDS")

tmpind <- which(mod@subSys, arr.ind = T)
subsysDT <- data.table(react.id = mod@react_id[tmpind[,1]],
                       react.name = mod@react_name[tmpind[,1]],
                       subsystem = colnames(mod@subSys)[tmpind[,2]])

The subsystem IDs correspond to the MetaCyc pathway IDs.

Reaction equations. Yes, in principle you will find all reactions that can be part of a gapseq model in the table seed_reactions_corrected.tsv. Yet, I need to add that this file is meant as internal program file to gapseq and some columns in this table, which gapseq doesn't need anymore, are not updated by us anymore. The table was originally derived from ModelSEED's biochemistry DB and we are going to restructure this file in gapseq rather soon. Long story short, I would recommend a different approach to retrieve reaction equations:

Continuing the script above, you can add a new column with the reaction equation to the subsysDT data table. To do so, please load this function in your R-environment: https://gist.github.com/Waschina/6fd9d7ac7509d8060477cc25d4918617

Next, it is a one-liner to add the equations:

subsysDT$equation <- print_reaction(mod, subsysDT$react.id)

I hope this helps :)

Silvio

almutwerner commented 2 years ago

Thank you for your answer! It worked well with my data. I do have some followup questions though:

-How are the subsystem IDs and the pathway IDs from -all-Pathways.tbl related? I noticed at least a few of them are the same, but I haven't checked for all of them. My goal is to assign a superpathway (where present) to all of the pathways in the -all-Pathways.tbl. So is it, that when the IDs are the same, it is already a superpathway, and if not, I can match it using the reaction ids?

-Just so I definitely got it right, the reactions with --> and <-- just work in that direction (so A --> B is the same reaction as B <-- A), the ones with <==> work in both?

Waschina commented 2 years ago

I'm happy to hear it worked! The subsystem-IDs should be a subset of all the pathway IDs in the [...]-all-Pathways.tbl. When you say superpathway, do you mean the Superpathways as defined by MetaCyc or the individual pathways parent classes? In any case, I think @jotech might have a recommendation to achieve the aggregation of pathways to larger groups.

About the arrows: Exactly.

almutwerner commented 2 years ago

I think that will be the parent classes. But any ways to aggregate pathways into groups would be appreciated really.

almutwerner commented 2 years ago

The react.id is the same as the ids in column dbhit of -all-Reactions.tbl, right? But some reactions from the .tbl don't have dbhits, or the ids don't match with the one from the RDS, or they have multiple. Is there a way to determine the reaction equation for each reaction of -all-Reactions.tbl?

Waschina commented 2 years ago

Yes, react.id directly corresponds to the reaction ids listed in the dbhit column. But please note, that not all reactions in the dbhit column will also occur in the final model. This is because gapseq is not using the entire model seed reaction database for model construction, but a curated subset.

Reaction equation for each reaction in *-all-Reactions.tbl: Each row in this table refers to a metaCyc reaction (with a few exceptions of gapseq-custom pathways) and the their reaction equations can probably retrieved from metaCyc.

jotech commented 2 years ago

Hi @almutwerner my answer comes a bit late, sorry! I just want to add to what Silvio wrote. You asked about the pathways and how they can be grouped.

My recommendation would be to use the metacyc pathway ontology. The corresponding subgroups for each pathway can be found in the column named hierarchy found in the table gapseq/dat/meta_pwy.tbl. The subgroups can then be used to combine pathways and have an integrated look.

almutwerner commented 2 years ago

Thanks, I'll give it a try!

jotech commented 2 years ago

cool! let us know if there is still something unclear :)