mgtools / MinPath

MinPath (Minimal set of Pathways) is a parsimony approach for biological pathway reconstructions using protein family predictions, achieving a more conservative, yet more faithful, estimation of the biological pathways for a query dataset.
17 stars 9 forks source link

Different results using similar methods on a model organism (E. coli) #9

Open jolespin opened 1 month ago

jolespin commented 1 month ago

I want to run this tool on ~100k de novo genomes but I want to make sure the I understand the results/interpretation.

I've tested it on E. coli in the following scenarios:

I checked for the number of pathways that passed minpath thresholds:

for fp in glob.glob("/Users/jolespin/Cloud/Informatics/Development/Forks/MinPath/test/e-coli/*/report.txt"):
    id = fp.split("/")[-2]
    data = list()
    with open(fp, "r") as f:
        for line in f:
            line = line.strip()
            if line:
                left, right = line.split("  name  ")
                fields = left.split(" ")
                fields = list(filter(bool, fields))
                reconstruction_available = fields[3]

                row = [
                    fields[1],
                    fields[2],
                    reconstruction_available if reconstruction_available != "n/a" else False,
                    bool(eval(fields[5])),
                    bool(eval(fields[7])),
                    fields[9],
                    fields[11],
                    right,
                ]
                data.append(row)
    df = pd.DataFrame(data, columns=["id_minpath", "database", "reconstruction_available", "naive_reconstructed", "minpath_passed", "number_of_families_in_reference_pathway", "number_of_families_annotated", "name"])
    df = df.set_index("id_minpath")
    print(id, df["minpath_passed"].sum())

I got the following results:

enzymes 388
enzymes-from-kegg 507
kegg-full 105
kegg-enzymes 94

I guess what I'm unclear on is why the number of completed pathways differs by so much using the various methods.

What method would you recommend for de novo genomes?

Archive.zip