steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

A observed discrepency between alignment-type 3Di+AA / 3Di #289

Open Wangchentong opened 2 weeks ago

Wangchentong commented 2 weeks ago

Expected Behavior

Thanks for your amazing tool! I am clustering a bunch of afdb subset which has high confidence with two alignment-type 3Di+AA / 3Di. In my intuition, 3Di should give more non-singleton cluster compared to 3Di+AA, because the very diverse sequence which hold same structure will be assined to same cluster in 3Di mode, and assigned to different clusters in 3Di+AA mode.

Current Behavior

I test the cluster command of two aliment types on the same database(a subset contains 4 million afdb structure), --alignment-type 0(3Di) gives me 470715 singleton --alignment-type 1(3Di) gives me 759500 singleton

this is the cluster command i use: foldseek cluster afdb50_new afdb50_new_clust_v2 tmp --remove-tmp-files --alignment-type 0/--alignment-type 2

Is this the epxpected result? It looks the cluster program based on solely 3Di token work worse than 3Di+AA, what;s your suggestion if i want to cluster on structure without AA token?

Any help will be gratitude!

milot-mirdita commented 2 weeks ago

Not using the amino-acid information will likely result in a less biologically meaningful result. Foldseek was optimized towards remote homology detection and for this I would recommend to stick to 3Di+AA. We were thinking of dropping the 3Di-only mode completely as, as we don't think that there are many applications where it's really meaningful.

I don't know towards what end you are clustering, I would recommend to focus more on your clustering criteria, like at what coverage, sequence-identity and E-value you still accept cluster members.

Wangchentong commented 2 weeks ago

Hi Milot @milot-mirdita , thanks for your quick response.

My purpose of clustering is to collect a dataset of highly diverse structures to train deep learning models. So i hope only to consider the structure's similarity rather than sequence identity.

So I will follow your advice to use the 3Di+AA alignment type, but I also want to know is there any other option that can strengthen my requirement?