soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
545 stars 134 forks source link

Using hh-suites to find E.coli gene in other Proteobacteria <advice on the way I am using the method> #369

Open Jigyasa3 opened 9 months ago

Jigyasa3 commented 9 months ago

Hey @milot-mirdita ,

Thanks for a great resource to find remote homologs! I am interested in finding an E.coli gene in other Proteobacteria. The literature shows that this gene is conserved in closely related strains only, so I am using HH-SUITE to find remote homologs of this gene in other Proteobacteria samples.

I would like to get some advice on the way I am using the HH-SUITE makes sense, and if the output is not a false positive/negative.

  1. I run hhblits to get all sequences similar to the E.coli gene of interest in the Uniclust30 cluster hhblits -cpu 4 -i ${IN_DIR}/ytfI_ecoli.fasta -d ${DB2}/UniRef30_2023_02 -oa3m ${OUT_DIR}/ytfI_ECOLI_uniclust.a3m -all

The idea behind the step1 is to get remote homologs for the E.coli gene of interest as HMMsearch against a single E.coli gene as the database doesn't give any results!

  1. The resulting .a3m file was converted back to fasta file using reformat.pl script.
  2. The hmmbuild command was used to convert the MSA into a database.
  3. I use hmmsearch on the Proteobacteria protein sequences against the database from step 3.

Unfortunately, this is not giving a hit that is "significant" enough i.e. the E.value of the hit was not less than 1e-3.

I am comparing the Proteobacteria sequences with the E.coli gene of interest using Foldseek's easy_search command too. And, I find no "significant" hit i.e. the E.value of the hit was not less than 1e-3.

So I am interested in understanding what could be considered a reasonable remote homolog of the gene, and if the two methods I am using make sense.

Regards, Jigyasa