steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

easy-search hangs on scop40 test #257

Open rcedgar opened 3 months ago

rcedgar commented 3 months ago

I'm trying to implement the SCOP40 test using the latest foldseek. The creatdb command completes; the easy-search command runs for a while but then hangs indefinitely. Advice welcomed for how to implement this in the best way for measuring foldseek speed and accuracy, thanks for any help!

# foldseek Version: 915ef7ddce1bd77080208eff8a434c0985ae7492

foldseek createdb \
  ../scop40pdb/pdb \
  scop40

/bin/time -v -o foldseek.time \
foldseek easy-search \
  ../scop40pdb/pdb \
  scop40 /
  --format-output "query,target,pident,evalue,alntmscore" \
  hits.txt
rcedgar commented 3 months ago

Update -- I was able to work around the problem by removing alntmscore from the format-output option, I'm guessing computing the TM alignment is much slower than the S-W 3Di alignment and is not needed to calculate the E-value.

12047019 commented 3 months ago

I want to know where you got the SCOP40 or 35 files to createdb? I have to do the SCOP against my bundles of protein structures but couldn't get the files to createdb.

rcedgar commented 3 months ago

https://wwwuser.gwdg.de/~compbiol/foldseek

martin-steinegger commented 3 months ago

I am not recommending to use this, it’s quite an old version. It make sense to use the latest for annotation or benchmarking https://scop.berkeley.edu/

rcedgar commented 3 months ago

Noted thanks, will do for anything written up but for preliminary work it's helpful that the expensive computes for DALI and TMalign are included in the downloads for the foldseek paper.

12047019 commented 3 months ago

Thanks @rcedgar @martin-steinegger, got it. It would be so kind of you if you preassemble and add it like other databases in the foldseek @martin-steinegger

rcedgar commented 3 months ago

Hi @martin-steinegger with --format-output "query,target,evalue" foldseek completes SCOP40 quickly but the sensitivity is lower than reported in the paper. Presumably I need to tweak some options such as --max-seqs and --exhaustive-search but I don't see the command line in Methods or Supp Data, What are recommended options for comparative validation? Thanks!

martin-steinegger commented 3 months ago

We have all scripts for benchmarking here https://github.com/steineggerlab/foldseek-analysis

rcedgar commented 3 months ago

Much better! Seems accuracy is getting close to DALI now, is there any explanation of improvements in the algorithm?

12047019 commented 1 week ago

Hello @rcedgar https://wwwuser.gwdg.de/~compbiol/foldseek

Can you please tell me the version of this scop40? is it SCOPe 2.08?

rcedgar commented 1 week ago

hello @12047019 sorry I don't know -- this is a question for the foldseek authors, I couldn't figure out the exact version myself, I had to use the scop_lookup.fix.tsv file in their repo to assign families to domains.

milot-mirdita commented 5 days ago

Clustering SCOPe 2.01 at 40% sequence identity yielded 11,211 non-redundant protein sequences (SCOPe40).

From the paper.