Open mschecht opened 7 months ago
Hey @mschecht,
While testing this I run into a problem related to the parsing of the dom table. Here is a reproducible workflow using this contigs-db:
# run hmms
anvi-run-hmms -c P_MARINUS_MIT9301-contigs.db \
-I Bacteria_71 \
--domain-hits-table \
--just-do-it \
--hmmer-output-dir HMM_OUTPUT
# filter stuff using the dom output table and get an error
anvi-script-filter-hmm-hits-table -c P_MARINUS_MIT9301-contigs.db \
--domain-hits-table HMM_OUTPUT/hmm.domtable \
--hmm-source Bacteria_71 \
--report-gene-and-model-coverage \
--min-model-coverage 0.9
HMM profiles .................................: 9 sources have been loaded: Bacteria_71 (71 genes, domain: bacteria), Archaea_76 (76 genes, domain: archaea), Ribosomal_RNA_23S (2
genes, domain: None), Ribosomal_RNA_28S (1 genes, domain: None), Ribosomal_RNA_5S (5 genes, domain: None), Ribosomal_RNA_16S (3
genes, domain: None), Ribosomal_RNA_12S (1 genes, domain: None), Protista_83 (83 genes, domain: eukarya), Ribosomal_RNA_18S (1
genes, domain: None)
Database Path ................................: P_MARINUS_MIT9301-contigs.db
Domtblout Path ...............................: HMM_OUTPUT/hmm.domtable
Config Error: Doesn't look like a --domtblout... anvi'o can't even... Please look at this
error message to find out what happened: Error tokenizing data. C error:
Expected 23 fields in line 2, saw 25
This is happening because of these lines in import_domtblout
:
colnames_coltypes_list = list(zip(*self.dom_table_columns))
colnames_coltypes_dict = dict(zip(colnames_coltypes_list[0], colnames_coltypes_list[1]))
The first line in dom table has 23 columns, but the next one has 25 if you only consider spaces to split due to changes in the length of the description (i.e., GrpE
vs Ribosome-binding factor A
):
GrpE PF01025.19 166 15 - 239 4.5e-45 145.5 6.9 1 1 8e-47 5.7e-45 145.1 6.4 6 165 48 208 35 209 0.96 GrpE
RBFA PF02033.18 105 125 - 130 4.2e-26 83.7 1.8 1 1 7.1e-28 5e-26 83.5 1.8 1 105 6 111 6 111 0.94 Ribosome-binding factor A
UPF0054 PF02130.17 128 199 - 179 2.3e-34 110.2 0.1 1 1 4.2e-36 3e-34 109.8 0.1 8 128 42 178 36 178 0.87 Uncharacterized protein family UPF0054
tRNA-synt_1d PF00750.19 349 204 - 604 1.3e-120 394.6 1.9 1 1 2.3e-122 1.7e-120 394.3 1.2 19 349 132 474 122 474 0.92 tRNA synthetases class I (R)
PGK PF00162.19 378 212 - 402 5.7e-163 534.4 0.2 1 1 9.2e-165 6.5e-163 534.2 0.2 2 378 10 391 9 391 0.97 Phosphoglycerate kinase
Ribosomal_L1 PF00687.21 204 220 - 235 9.5e-56 180.9 0.0 1 1 1.7e-57 1.2e-55 180.6 0.0 2 204 30 219 29 219 0.99 Ribosomal protein L1p/L10e family
SecE PF00584.20 55 223 - 80 1.7e-14 45.8 0.1 1 1 2.8e-16 2e-14 45.5 0.1 2 54 26 78 25 79 0.91 SecE/Sec61-gamma subunits of protein translocation complex
Chorismate_synt PF01264.21 327 242 - 365 1.2e-144 473.2 0.1 1 1 1.9e-146 1.3e-144 473.0 0.1 1 327 9 354 9 354 0.96 Chorismate synthase
Pept_tRNA_hydro PF01195.19 171 269 - 198 6.2e-55 178.0 0.4 1 1 1.1e-56 7.5e-55 177.7 0.4 2 165 6 186 5 194 0.89 Peptidyl-tRNA hydrolase
AICARFT_IMPCHas PF01808.18 296 284 - 517 7.9e-114 372.2 2.2 1 1 1.8e-115 1.3e-113 371.5 2.1 1 295 134 450 134 451 0.94 AICARFT/IMPCHase bienzyme
If you change the description for GrpE
with GrpE x x
manually in the table output, then you get this error, which shows that this is due to improper parsing of the table:
anvi-script-filter-hmm-hits-table -c P_MARINUS_MIT9301-contigs.db \
--domain-hits-table HMM_OUTPUT/hmm.domtable \
--hmm-source Bacteria_71 \
--report-gene-and-model-coverage \
--min-model-coverage 0.9
HMM profiles .................................: 9 sources have been loaded: Bacteria_71 (71 genes, domain: bacteria), Archaea_76 (76 genes, domain: archaea), Ribosomal_RNA_23S (2
genes, domain: None), Ribosomal_RNA_28S (1 genes, domain: None), Ribosomal_RNA_5S (5 genes, domain: None), Ribosomal_RNA_16S (3
genes, domain: None), Ribosomal_RNA_12S (1 genes, domain: None), Protista_83 (83 genes, domain: eukarya), Ribosomal_RNA_18S (1
genes, domain: None)
Database Path ................................: P_MARINUS_MIT9301-contigs.db
Domtblout Path ...............................: HMM_OUTPUT/hmm.domtable
Config Error: Doesn't look like a --domtblout... anvi'o can't even... Please look at this
error message to find out what happened: Error tokenizing data. C error:
Expected 25 fields in line 3, saw 26
The other instances that work with dom table knows that this is a stupid output file format with a variable number of length for the description column, and uses maxsplit
argument to capture the table structure -- like here in the same file as an example:
with open(self.domtblout) as dom_table:
dom_table_entries = [l.strip('\n').split(maxsplit=len(self.dom_table_columns) - 1) for l in dom_table.readlines()]
FYI :)
I think users will find it valuable to quickly explore the distribution of gene and model coverage values of
anvi-run-hmms --domain-hits-table
. To do this I implementedanvi-script-filter-hmm-hits-table --report-gene-and-model-coverage
which will add two columns to the hmmsearchdomtblout
(model_coverage
andgene_coverage
) and report it ashmm_domtabl_alignment_coverage.tsv
in the same directory as the domtblout.You can test the
--report-gene-and-model-coverage
and see the output file here: