jotech / gapseq

Informed prediction and analysis of bacterial metabolic pathways and genome-scale networks
GNU General Public License v3.0
150 stars 32 forks source link

Exploit gapfilled model for metabolic comparison #219

Open mgabriell1 opened 3 months ago

mgabriell1 commented 3 months ago

Hi, First of all thanks for developing and maintaining Gapseq! From what I've understood the -Pathways.tbl file summarise the completeness of the MetaCYC pathways found in the genome analyzed. After that gapfilling is performed to (as said in the name ;)) fill the gaps potential arising from incompleteness or missed genes. If someone would want to look at the diversity among different metabolic profiles (e.g., as done here: 10.1038/s43705-023-00221-z, incompleteness (either observed or hidden) might be skew the results. Given this, I was thinking that creating another version of -Pathways.tbl after gapfilling would address the issue as this in theory would take care of incompleteness in the pathways completion percentages. Do you think this makes sense? If so, what would be the best method to do so? Thanks again!

jotech commented 3 months ago

Hi @mgabriell1

Thank you for your interest in gapseq!

What you are proposing is indeed an interesting application! My first guess would be to check out the reaction attributes table of the final model after gap-filling. Using that table, you can trace the origin of each reaction (added because of sequence homology, gap filling, etc.), which can then be linked back to the pathway level. More details can be found here: https://gapseq.readthedocs.io/en/latest/tutorials/traceability.html

Let me know if this can work for you.

mgabriell1 commented 2 months ago

Hi, Thanks for the quick reply!

That table seems to contain all the info that I need, but I am not completely sure on how to link it to meta metaCYC pathways. Using the ecoli.RDS test model I see in the reaction attribute table there are 3029 reactions which are distributed in 560 pathways, while in the all-Pathways.tbl file there are 2900 pathways. So in the model a lot are excluded (makes sense), but I would also like to know which pathways are not complete.

Looking at the source file I found the file meta_rea_pwy-gapseq.tbl which seems to contain the mapping between pathways and reactions which could be used to estimate the pathway completeness, but I'm having troubles linking that file with the reactions attribute table. For example, \|PWY-7184\| (a random pwy) has a completeness of 88% in all-Pathways.tbl and the rxn DCDPKIN-RXN is among the ones found. However, reaction attribute table this rxn ID is not present in the column rxn, but rather in the column biocyc ID. So if I would use the column rxn I would not detect this. Conversely, column biocyc ID is often empty and so I cannot use that column as well. Other rxn ids are present in the column gapseq in meta_rea_pwy-gapseq.tbl, but often these are empty and so again not useful to map the two tables. Given that the information seems to be linked across different columns, I'm not sure if I'm using the correct information. Could you confirm that this seems to be the right track? If so, I guess I can check the different columns and then check whether any of them has a match.

Another question that I have is: gapseq estimates pwy completeness just by dividing the number of genes present over the total number of genes? I'm asking this to understand whether I wonder to consider "parallel" pwy routes or that has already be taken into account in metaCYC.

I hope this was clear enough. Thanks again for the support!

Waschina commented 2 months ago

Hi @mgabriell1

I would not use the meta_rea_pwy-gapseq.tbl. Instead, the problem could be addressed for example like this:

library(sybil)
library(data.table)

# read model data
mod_filled <- readRDS("yogurt/ldel.RDS")
mod_draft  <- readRDS("yogurt/ldel-draft.RDS")
gs_findR   <- fread("yogurt/ldel-all-Reactions.tbl")
gs_findP   <- fread("yogurt/ldel-all-Pathways.tbl")

# read and prepare pathway DB
pwyDB <- rbind(fread("~/Software/gapseq/dat/meta_pwy.tbl"),
               fread("~/Software/gapseq/dat/custom_pwy.tbl"))
pwyDB <- pwyDB[!duplicated(id)]
pwyDB <- pwyDB[, .(id, name, spont, reaId)]
pwyDB <- pwyDB[id %in% gs_findP$ID]
pwyDB[, reaNr := length(unlist(strsplit(reaId, ","))), by = id]
pwyDB[, spontNr := length(unlist(strsplit(spont, ","))), by = id]

# identify which reactions were added during gapfilling
rxnGF <- mod_filled@react_id[!(mod_filled@react_id %in% mod_draft@react_id)]
rxnGF <- rxnGF[!grepl("^EX_",rxnGF)] # exclude exchange reactions
rxnGF <- gsub("_c0$","",rxnGF)

# get BioCyc-IDs of added reactions
newBC <- lapply(rxnGF, function(x) gs_findR[grepl(x, dbhit), rxn])
newBC <- unique(unlist(newBC))

# get Pathways, in which die newBC participate
newBC_pwys <- lapply(newBC, function(x) gs_findR[rxn == x, unique(pathway)])
names(newBC_pwys) <- newBC

# add new reaction to pathway table
gs_findP$newReactionsFound <- ""
for(rxni in newBC) {
  for(pwyi in newBC_pwys[[rxni]]) {
    gs_findP[ID == pwyi, newReactionsFound := paste0(newReactionsFound,rxni, sep = " ")]
  }
}
gs_findP[, newReactionsFound := gsub("^ | $","",newReactionsFound)] # remove trailing spaces

# merge and recalc completeness
gs_findP <- merge(gs_findP, pwyDB, by.x = "ID", by.y = "id")
gs_findP[,Nold := length(unlist(strsplit(ReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew := length(unlist(strsplit(newReactionsFound, " "))), by = "ID"]

gs_findP[,C_old := (Nold + VagueReactions)/(reaNr - spontNr)*100]
gs_findP[,C_new := (Nold + Nnew + VagueReactions)/(reaNr - spontNr)*100]

I hope the code comments make it clear what happens at each step. The examples use gapseq data from here: https://github.com/Waschina/gapseq.tutorial.data/tree/master/yogurt

I noticed that there seem to be some inconsistencies in the way the pathway completeness is calculated in the ...-all-Pathways.tbl. I'll investigate this.

Waschina commented 2 months ago

Small addition: With the example above, I noticed that there is a small inconsistency in how the completeness reported in the ...-all-pathways.tbl is calculated. See issue #216 . We are working on a fix

mgabriell1 commented 2 months ago

Hi Silvio, Thanks a lot for the code! I've tried it with the tutorial file and one of my genomes (attached testData.zip) and I've noticed that a few pathways have a new completeness estimation above 100%. I found a couple of reasons while that was the case:

With these edits I've managed to reduce the number of pathways more than 100% complete, but in few cases this still occurred. All these pathways present vague reactions and their number (derived from -all-Pathways.tbl) is higher than the number of reactions in pwyDB (e.g., "|PWY-7695|" is indicated with 12 vague reactions, but only 7 reactions in pwyDB). Could it be an issue with the reactions database?

Here is the edited code (not the prettiest, but it seems to do the job):


###
library(sybil)
library(data.table)

# read model data
# mod_filled <- readRDS("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel.RDS")
# mod_draft  <- readRDS("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel-draft.RDS")
# gs_findR   <- fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel-all-Reactions.tbl")
# gs_findP   <- fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel-all-Pathways.tbl")

mod_filled <- readRDS("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_models/Legio00005_manualCheck-contigs.RDS")
mod_draft  <- readRDS("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_models/model_drafts/Legio00005_manualCheck-contigs-draft.RDS")
gs_findR   <- fread("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_reactions/Legio00005_manualCheck-contigs-all-Reactions.tbl.gz")
gs_findP   <- fread("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_pathways/Legio00005_manualCheck-contigs-all-Pathways.tbl")

# read and prepare pathway DB
pwyDB <- rbind(fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/dat/meta_pwy.tbl"),
               fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/dat/custom_pwy.tbl"))
pwyDB <- pwyDB[!duplicated(id)]
pwyDB <- pwyDB[, .(id, name, spont, reaId)]
pwyDB <- pwyDB[id %in% gs_findP$ID]
pwyDB[, reaNr := length(unlist(strsplit(reaId, ","))), by = id]
pwyDB[, spontNr := length(unlist(strsplit(spont, ","))), by = id]

# identify which reactions were added during gapfilling
rxnGF <- mod_filled@react_id[!(mod_filled@react_id %in% mod_draft@react_id)]
rxnGF <- rxnGF[!grepl("^EX_",rxnGF)] # exclude exchange reactions
rxnGF <- gsub("_c0$","",rxnGF)

# get BioCyc-IDs of added reactions
newBC <- lapply(rxnGF, function(x) gs_findR[grepl(x, dbhit), rxn])
newBC <- unique(unlist(newBC))

# get Pathways, in which die newBC participate
newBC_pwys <- lapply(newBC, function(x) gs_findR[rxn == x, unique(pathway)])
names(newBC_pwys) <- newBC

# add new reaction to pathway table
gs_findP$newReactionsFound <- ""
for(rxni in newBC) {
  for(pwyi in newBC_pwys[[rxni]]) {
    gs_findP[ID == pwyi, newReactionsFound := paste0(newReactionsFound,rxni, sep = " ")]
  }
}
gs_findP[, newReactionsFound := gsub("^ | $","",newReactionsFound)] # remove trailing spaces

# # Remove duplicated entries and ones not present in pwyDB
for (i in 1:nrow(gs_findP)){
  #print(c(i, gs_findP[i, "newReactionsFound"]))

  newReacts_tmp <- unlist(strsplit(as.character(gs_findP[i, "newReactionsFound"]), " "))
  oldReacts_tmp <- unlist(strsplit(as.character(gs_findP[i, "ReactionsFound"]), " "))
  gs_findP[i, "newReactionsFound"] <- paste0(newReacts_tmp[!(newReacts_tmp %in% oldReacts_tmp)], collapse  = " ")

  allReacts_tmp <- unlist(strsplit(pwyDB[ id == as.character(gs_findP[i, ID]), reaId], ","))
  gs_findP[i, "ReactionsFound"] <- paste0(oldReacts_tmp[(oldReacts_tmp %in% allReacts_tmp)], collapse  = " ")
}

# merge and recalc completeness
gs_findP <- merge(gs_findP, pwyDB, by.x = "ID", by.y = "id")
gs_findP[,Nold := length(unlist(strsplit(ReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew := length(unlist(strsplit(newReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew_spont := length(Reduce(intersect, list(unlist(strsplit(newReactionsFound, " ")), unlist(strsplit(spont, ","))))), by = "ID"]

gs_findP[,C_old := (Nold + VagueReactions)/(reaNr - spontNr)*100]
gs_findP[,C_new := (Nold + Nnew + VagueReactions)/(reaNr - spontNr)*100]

gs_findP[,diff_C := C_new - C_old]
gs_findP[,C_new_corr := (Nold + Nnew - Nnew_spont + VagueReactions)/(reaNr - spontNr)*100] # Remove new spontaneous reactions

Thanks again for the help!