morgannprice / PaperBLAST

PaperBLAST: find papers about a protein or its homologs
http://papers.genomics.lbl.gov
GNU General Public License v3.0
35 stars 6 forks source link

How to interpret results? #18

Open tpellegrinetti opened 5 days ago

tpellegrinetti commented 5 days ago

Hi, thanks for creating this amazing tool!

I have 90 nearly complete MAGs, and I'm looking to identify amino acid auxotrophy using GapMind. Since the web version is challenging to use with so many genomes, I'm using the command-line version.

Following the tutorial, I noticed it generates several tables:

aa.hits

aa.revhits

aa.sum.cand

aa.sum.rules

aa.sum.steps

orgs.faa

orgs.org

I’m finding this output a bit confusing. Could you clarify:

  1. Which table should I check to identify candidates with high, medium, and low confidence?

  2. What does each table represent?

Thanks in advance!

morgannprice commented 5 days ago

The *.sum.cand file lists all the potential candidates for each step in each pathway, along with a score (2 for high confidence, 1 for medium confidence, 0 for low confidence).

The *.sum.steps file lists all the steps in each pathway, along with the best candidate, its score (if there is a candidate), and whether or not this step is on the best path.

The *.sum.rules file lists the number of high, medium, or low-confidence steps for each rule. Usually I only look at the rows with rule="all" (meaning, the totals for that entire biosynthetic pathway).

When analyzing many genomes, I usually focus on the rule="all" subset of the *.sum.rules file.

Depending on how you set up your run, the orgId or gid (genome id) values in that table may be hash-based strings, and the orgs.org table may explain what they mean.

Message ID: @.***>