merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
415 stars 142 forks source link

Report gene and model coverage from hmmsearch #2186

Open mschecht opened 7 months ago

mschecht commented 7 months ago

I think users will find it valuable to quickly explore the distribution of gene and model coverage values of anvi-run-hmms --domain-hits-table. To do this I implemented anvi-script-filter-hmm-hits-table --report-gene-and-model-coverage which will add two columns to the hmmsearch domtblout (model_coverage and gene_coverage) and report it as hmm_domtabl_alignment_coverage.tsv in the same directory as the domtblout.

You can test the --report-gene-and-model-coverage and see the output file here:

rm -rf metagenomics-full-test; anvi-self-test --suite metagenomics-full -o metagenomics-full-test

 head metagenomics-full-test/hmm_domtabl_alignment_coverage.tsv
meren commented 7 months ago

Hey @mschecht,

While testing this I run into a problem related to the parsing of the dom table. Here is a reproducible workflow using this contigs-db:

# run hmms
anvi-run-hmms -c P_MARINUS_MIT9301-contigs.db \
              -I Bacteria_71 \
              --domain-hits-table \
              --just-do-it \
              --hmmer-output-dir HMM_OUTPUT

# filter stuff using the dom output table and get an error
anvi-script-filter-hmm-hits-table -c P_MARINUS_MIT9301-contigs.db \
                                  --domain-hits-table HMM_OUTPUT/hmm.domtable \
                                  --hmm-source Bacteria_71 \                                  
                                  --report-gene-and-model-coverage \
                                  --min-model-coverage 0.9

HMM profiles .................................: 9 sources have been loaded: Bacteria_71 (71 genes, domain: bacteria), Archaea_76 (76 genes, domain: archaea), Ribosomal_RNA_23S (2
                                                genes, domain: None), Ribosomal_RNA_28S (1 genes, domain: None), Ribosomal_RNA_5S (5 genes, domain: None), Ribosomal_RNA_16S (3
                                                genes, domain: None), Ribosomal_RNA_12S (1 genes, domain: None), Protista_83 (83 genes, domain: eukarya), Ribosomal_RNA_18S (1
                                                genes, domain: None)
Database Path ................................: P_MARINUS_MIT9301-contigs.db
Domtblout Path ...............................: HMM_OUTPUT/hmm.domtable

Config Error: Doesn't look like a --domtblout... anvi'o can't even... Please look at this
              error message to find out what happened: Error tokenizing data. C error:
              Expected 23 fields in line 2, saw 25

This is happening because of these lines in import_domtblout:

        colnames_coltypes_list = list(zip(*self.dom_table_columns))
        colnames_coltypes_dict = dict(zip(colnames_coltypes_list[0], colnames_coltypes_list[1]))

The first line in dom table has 23 columns, but the next one has 25 if you only consider spaces to split due to changes in the length of the description (i.e., GrpE vs Ribosome-binding factor A):

GrpE                 PF01025.19   166 15                   -            239   4.5e-45  145.5   6.9   1   1     8e-47   5.7e-45  145.1   6.4     6   165    48   208    35   209 0.96 GrpE
RBFA                 PF02033.18   105 125                  -            130   4.2e-26   83.7   1.8   1   1   7.1e-28     5e-26   83.5   1.8     1   105     6   111     6   111 0.94 Ribosome-binding factor A
UPF0054              PF02130.17   128 199                  -            179   2.3e-34  110.2   0.1   1   1   4.2e-36     3e-34  109.8   0.1     8   128    42   178    36   178 0.87 Uncharacterized protein family UPF0054
tRNA-synt_1d         PF00750.19   349 204                  -            604  1.3e-120  394.6   1.9   1   1  2.3e-122  1.7e-120  394.3   1.2    19   349   132   474   122   474 0.92 tRNA synthetases class I (R)
PGK                  PF00162.19   378 212                  -            402  5.7e-163  534.4   0.2   1   1  9.2e-165  6.5e-163  534.2   0.2     2   378    10   391     9   391 0.97 Phosphoglycerate kinase
Ribosomal_L1         PF00687.21   204 220                  -            235   9.5e-56  180.9   0.0   1   1   1.7e-57   1.2e-55  180.6   0.0     2   204    30   219    29   219 0.99 Ribosomal protein L1p/L10e family
SecE                 PF00584.20    55 223                  -             80   1.7e-14   45.8   0.1   1   1   2.8e-16     2e-14   45.5   0.1     2    54    26    78    25    79 0.91 SecE/Sec61-gamma subunits of protein translocation complex
Chorismate_synt      PF01264.21   327 242                  -            365  1.2e-144  473.2   0.1   1   1  1.9e-146  1.3e-144  473.0   0.1     1   327     9   354     9   354 0.96 Chorismate synthase
Pept_tRNA_hydro      PF01195.19   171 269                  -            198   6.2e-55  178.0   0.4   1   1   1.1e-56   7.5e-55  177.7   0.4     2   165     6   186     5   194 0.89 Peptidyl-tRNA hydrolase
AICARFT_IMPCHas      PF01808.18   296 284                  -            517  7.9e-114  372.2   2.2   1   1  1.8e-115  1.3e-113  371.5   2.1     1   295   134   450   134   451 0.94 AICARFT/IMPCHase bienzyme

If you change the description for GrpE with GrpE x x manually in the table output, then you get this error, which shows that this is due to improper parsing of the table:

anvi-script-filter-hmm-hits-table -c P_MARINUS_MIT9301-contigs.db \
                                  --domain-hits-table HMM_OUTPUT/hmm.domtable \
                                  --hmm-source Bacteria_71 \                                  
                                  --report-gene-and-model-coverage \
                                  --min-model-coverage 0.9

HMM profiles .................................: 9 sources have been loaded: Bacteria_71 (71 genes, domain: bacteria), Archaea_76 (76 genes, domain: archaea), Ribosomal_RNA_23S (2
                                                genes, domain: None), Ribosomal_RNA_28S (1 genes, domain: None), Ribosomal_RNA_5S (5 genes, domain: None), Ribosomal_RNA_16S (3
                                                genes, domain: None), Ribosomal_RNA_12S (1 genes, domain: None), Protista_83 (83 genes, domain: eukarya), Ribosomal_RNA_18S (1
                                                genes, domain: None)
Database Path ................................: P_MARINUS_MIT9301-contigs.db
Domtblout Path ...............................: HMM_OUTPUT/hmm.domtable

Config Error: Doesn't look like a --domtblout... anvi'o can't even... Please look at this
              error message to find out what happened: Error tokenizing data. C error:
              Expected 25 fields in line 3, saw 26

The other instances that work with dom table knows that this is a stupid output file format with a variable number of length for the description column, and uses maxsplit argument to capture the table structure -- like here in the same file as an example:

 with open(self.domtblout) as dom_table:
            dom_table_entries = [l.strip('\n').split(maxsplit=len(self.dom_table_columns) - 1) for l in dom_table.readlines()]

FYI :)