Thanks for a great resource to find remote homologs!
I am interested in finding an E.coli gene in other Proteobacteria. The literature shows that this gene is conserved in closely related strains only, so I am using HH-SUITE to find remote homologs of this gene in other Proteobacteria samples.
I would like to get some advice on the way I am using the HH-SUITE makes sense, and if the output is not a false positive/negative.
I run hhblits to get all sequences similar to the E.coli gene of interest in the Uniclust30 cluster
hhblits -cpu 4 -i ${IN_DIR}/ytfI_ecoli.fasta -d ${DB2}/UniRef30_2023_02 -oa3m ${OUT_DIR}/ytfI_ECOLI_uniclust.a3m -all
The idea behind the step1 is to get remote homologs for the E.coli gene of interest as HMMsearch against a single E.coli gene as the database doesn't give any results!
The resulting .a3m file was converted back to fasta file using reformat.pl script.
The hmmbuild command was used to convert the MSA into a database.
I use hmmsearch on the Proteobacteria protein sequences against the database from step 3.
Unfortunately, this is not giving a hit that is "significant" enough i.e. the E.value of the hit was not less than 1e-3.
I am comparing the Proteobacteria sequences with the E.coli gene of interest using Foldseek's easy_search command too. And, I find no "significant" hit i.e. the E.value of the hit was not less than 1e-3.
So I am interested in understanding what could be considered a reasonable remote homolog of the gene, and if the two methods I am using make sense.
Hey @milot-mirdita ,
Thanks for a great resource to find remote homologs! I am interested in finding an E.coli gene in other Proteobacteria. The literature shows that this gene is conserved in closely related strains only, so I am using HH-SUITE to find remote homologs of this gene in other Proteobacteria samples.
I would like to get some advice on the way I am using the HH-SUITE makes sense, and if the output is not a false positive/negative.
hhblits
to get all sequences similar to the E.coli gene of interest in the Uniclust30 clusterhhblits -cpu 4 -i ${IN_DIR}/ytfI_ecoli.fasta -d ${DB2}/UniRef30_2023_02 -oa3m ${OUT_DIR}/ytfI_ECOLI_uniclust.a3m -all
The idea behind the step1 is to get remote homologs for the E.coli gene of interest as HMMsearch against a single E.coli gene as the database doesn't give any results!
.a3m
file was converted back to fasta file usingreformat.pl
script.hmmbuild
command was used to convert the MSA into a database.hmmsearch
on the Proteobacteria protein sequences against the database from step 3.Unfortunately, this is not giving a hit that is "significant" enough i.e. the E.value of the hit was not less than 1e-3.
I am comparing the Proteobacteria sequences with the E.coli gene of interest using Foldseek's
easy_search
command too. And, I find no "significant" hit i.e. the E.value of the hit was not less than 1e-3.So I am interested in understanding what could be considered a reasonable remote homolog of the gene, and if the two methods I am using make sense.
Regards, Jigyasa