steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
780 stars 99 forks source link

Can I search for similar protein structures to those in my query database and pick out representative based on clustering? #204

Closed dengvicki closed 11 months ago

dengvicki commented 11 months ago

I want to find protein structures similar to my query database, filter results by TM-score above 0.6, and then find cluster representatives for each protein in my query database.

I followed:

foldseek createdb example/ targetDB
foldseek createdb example/ queryDB
foldseek search queryDB targetDB aln tmpFolder -a #is there a way to filter results here by e-value?
foldseek aln2tmscore queryDB targetDB aln aln_tmscore #is there a way to filter results here by tm-score?
foldseek createtsv queryDB targetDB aln_tmscore aln_tmscore.tsv

I was going to use the following for clustering:

foldseek createdb example/ db
foldseek search db db aln tmpFolder -c 0.8 
foldseek clust db aln clu #is the db here the queryDB or targetDB?
foldseek createtsv db db clu clu.tsv

I see that the this workflow does a new alignment search where it filters results that have 80% coverage. This is closer to what I'm looking for except instead of sequence coverage, I want e-value/TM-score threshold cutoff.

Problems I'm running into: