RCSB PDB like Sequence Clustering

soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite

https://mmseqs.com

GNU General Public License v3.0

1.4k stars 195 forks source link

RCSB PDB like Sequence Clustering #452

Open zeynepabali opened 3 years ago

zeynepabali commented 3 years ago

Hi, I am not sure if this is the right place to ask this, but is there a set of options to recreate the same clustering as the ones in the weekly sequence clustering of PDB. As in this link for example: https://cdn.rcsb.org/resources/sequence/clusters/bc-100.out

milot-mirdita commented 3 years ago

AFAIK, the PDB is using an MMseqs2 based workflow, but I don't really know what they are doing. @martin-steinegger added some features at the request of the PDB team, he might be able to put you in contact with the right people.

zeynepabali commented 3 years ago

Thank you very much. I will try to get in contact with him.

martin-steinegger commented 3 years ago

I had contact quite some time with Zukang Feng (https://www.rcsb.org/pages/team) from the PDB. I am actually not sure what parameters they exactly they use at the moment. Maybe it would be good to contact him.

However, I remember that they replaced blastclust. blastclust uses connected component clustering. So you need use --cluster-mode 1 in mmseqs.

mmseqs cluster pdb_seq_pr pdb_seq_pr_clu_s8_maxseqs1000 tmp_clu7 --cov-mode 0 -c 0.90 --min-seq-id 0.3 -s 7 --max-seqs 1000 --cluster-mode 1 -a

ZanHP commented 2 years ago

Hello, have you maybe figured this out?

josemduarte commented 2 years ago

This is what is used internally at RCSB PDB (with a few different thresholds for sequence identitiy):

mmseqs easy-cluster pdb_protein_sequence.fasta-A.gz session --min-seq-id 0.3 -c 0.9 -s 8 --max-seqs 1000 --cluster-mode 1