soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
527 stars 132 forks source link

Eukaryote only sequence database #168

Open BrennicaMarlow opened 5 years ago

BrennicaMarlow commented 5 years ago

I want to use hhblits to make a multiple sequence alignment using only eukaryote sequences. Is there a way to get only the eukaryote sequences from the uniclust database.

:exclamation: Make to check out our User Guide.

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps.

HH-suite Output (for bugs)

Please make sure to post the complete output of the tool you called. Please use gist.github.com.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the issue in.

danyilgrybchuk commented 4 years ago

I want to use hhblits to make a multiple sequence alignment using only eukaryote sequences. Is there a way to get only the eukaryote sequences from the uniclust database.

Hi BrennicaMarlow and All who are reading this,

Actually, I want to do the same with proteins form dsDNA viruses. The best (partial) answer, that I have so far, is to use ffindex_get utility (comes together with the hhsuite-3.2.0) to parse the UniRef30_2020_02_a3m.ffdata by their indices and retrieve the alignment that correspond to specific organism. Something like this

$ ffindex_get UniRef30_2020_02_a3m.ffdata UniRef30_2020_02_a3m.ffindex 110848668 110849024 110850663 110850770 11085238

Then, recalculate HMMs and context states on these sub-alignments with hhmake and cstranslate, correspondingly, and generally follow the guidelines for building customized alignments from MSAs.

The problem here, however, is that there is no correspondence between database index in ffindex file (those 110848668 110849024 110850663 110850770 11085238 in the command above) and taxonomic group. I imagine, it's possible to write a script that will establish this correspondence, because, if you check the headers of sequences in UniRef30_2020_02_a3m.ffdata, you will notice that they contain TaxID="NCBI Taxonomy ID". But maybe dear developers can advise us a better way to solve this problem

Best regards, Danyil