steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

Easy search pulls same accession ID multiple times? #279

Open bregman3 opened 1 month ago

bregman3 commented 1 month ago

Hello I used the easy-search function against the AlphaFold database with the exhaustive search metrics and I noticed my results were greatly conflated where I would have the same protein ID hit multiple times. My query structure was a PDB file that is a singular chain. The hit would be also a singular chain, and I checked to make sure it only occurred once in both the AFDB and the UniProt DB. I also noticed my downloaded file for the afdb is I use the github is significantly smaller than what the afdb website says that database should be. Does anyone else have these issues?

milot-mirdita commented 1 month ago

Could you please post the full command line call and terminal output? Additionally please post an excerpt of the result file. This doesn't sound like something that should happen without explicitly requesting some parameters (i.e. --alt-ali).

bregman3 commented 4 weeks ago

Hello here is the command line call: foldseek easy-search --exhaustive-search --max-seqs 10000 5sxy.pdb $BIODB/afdb/afdb aln4 FoldSeek I realize the max seqs is not useful due to the exhaustive search skipping the prefilter. My input is a single chain PDB. I'm a little confused because in the output, which I've just copy and pasted into an excel sheet so I could highlight the same repeated hit, it also has different models for my query protein despite the single input.

FoldSeek_results_screenshot FoldSeek_results_screenshot

milot-mirdita commented 4 weeks ago

That's an NMR structure. Each model becomes another query, which results into likely exactly the same result list for each query.

NMR structures are a bit of a footgun with foldseek.

bregman3 commented 3 weeks ago

oh okay thank you so much!