MinPath (Minimal set of Pathways) is a parsimony approach for biological pathway reconstructions using protein family predictions, achieving a more conservative, yet more faithful, estimation of the biological pathways for a query dataset.
17
stars
9
forks
source link
Different results using similar methods on a model organism (E. coli) #9
I want to run this tool on ~100k de novo genomes but I want to make sure the I understand the results/interpretation.
I've tested it on E. coli in the following scenarios:
enzymes (mode=-ec) - I created smart table with EcoCyc to get all of the enzymes (included in Archive.zip).
enzymes-from-kegg (mode=-ec) - I ran KofamScan against all of the proteins from the NCBI reference assembly. I used only the KO with enzyme hits. I then converted the KO to EC numbers that are in the KO description.
kegg-enzymes (mode=-ko) - I ran KofamScan against all of the proteins from the NCBI reference assembly. I used only the KO with enzyme hits.
kegg-full (mode=-ko) - I ran KofamScan against all of the proteins from the NCBI reference assembly. I used all KO with hits.
I checked for the number of pathways that passed minpath thresholds:
for fp in glob.glob("/Users/jolespin/Cloud/Informatics/Development/Forks/MinPath/test/e-coli/*/report.txt"):
id = fp.split("/")[-2]
data = list()
with open(fp, "r") as f:
for line in f:
line = line.strip()
if line:
left, right = line.split(" name ")
fields = left.split(" ")
fields = list(filter(bool, fields))
reconstruction_available = fields[3]
row = [
fields[1],
fields[2],
reconstruction_available if reconstruction_available != "n/a" else False,
bool(eval(fields[5])),
bool(eval(fields[7])),
fields[9],
fields[11],
right,
]
data.append(row)
df = pd.DataFrame(data, columns=["id_minpath", "database", "reconstruction_available", "naive_reconstructed", "minpath_passed", "number_of_families_in_reference_pathway", "number_of_families_annotated", "name"])
df = df.set_index("id_minpath")
print(id, df["minpath_passed"].sum())
I want to run this tool on ~100k de novo genomes but I want to make sure the I understand the results/interpretation.
I've tested it on E. coli in the following scenarios:
mode=-ec
) - I created smart table withEcoCyc
to get all of the enzymes (included in Archive.zip).mode=-ec
) - I ranKofamScan
against all of the proteins from the NCBI reference assembly. I used only the KO with enzyme hits. I then converted the KO to EC numbers that are in the KO description.mode=-ko
) - I ranKofamScan
against all of the proteins from the NCBI reference assembly. I used only the KO with enzyme hits.mode=-ko
) - I ranKofamScan
against all of the proteins from the NCBI reference assembly. I used all KO with hits.I checked for the number of pathways that passed minpath thresholds:
I got the following results:
I guess what I'm unclear on is why the number of completed pathways differs by so much using the various methods.
What method would you recommend for de novo genomes?
Archive.zip