Profile-profile database clustering

ksteczk commented 6 years ago

I know that could be quite unpopular clustering strategy, but did you consider clustering based on comparing profiles? That would require adding additional step in cascade clustering for clustering within PSI-BLAST distances - I wondered if you have worked out some strategy on that. I'm going to cluster the whole nr database in that way so I'm exploring the possibilities. I expect that expanding -e option to, let it be 0.05 won't give results comparable to evalue's of e.g. 0.005 but with PSI-BLAST (or profile searches with MMseqs2, which I'm going to use).

martin-steinegger commented 6 years ago

Sorry for the late answer @ksteczk. What do you mean by PSI-BLAST distances?

The following workflow is a profile consensus clustering.

As a first step I would remove redundancy from your set:

 mmseqs linclust NR NR_95_90_clu /tmp -c 0.95 --min-seq-id 0.95
 mmseqs createsubdb  NR_95_90_clu  NR NR_95_90_repseq

Create profiles by iterative search:

 mmseqs search NR_95_90_repseq NR_95_90_repseq NR_res /tmp --num-iterations 3 
 mmseqs result2profile NR_95_90_repseq NR_95_90_repseq NR_res NR_profile

Cluster profiles:

 mmseqs search NR_profile NR_consensus NR_pp_res /tmp  # Add your cluster criteria here
 mmseqs clust NR_profile NR_pp_res NR_pp_clu

ksteczk commented 6 years ago

Thank you for the workflow. I'll try that. For now as I explore the software a little bit (I'm amazed with its speed and scalability) I cluster smaller database.

A part of my workflow is generating profiles for pdb_05_rep database based on nr70_rep database:

mmseqs search pdb_05_rep nr70_rep pdb_05_res tmp --num-iterations 3 
mmseqs result2profile pdb_05_rep nr70_rep pdb_05_res pdb_05_profile

The second command however blasts my bash with these warnings:

Warning in /opt/install/MMseq2/src/alignment/MsaFilter.cpp:244: filter: Filtering removed all sequences in alignment. Inserting back first sequence.

Is that OK? Or something went wrong with the procedure?

And further I run:

mmseqs search pdb_05_profile pdb_05_rep pdb_05_pp_res tmp  --threads 110 -e 0.05
mmseqs clust pdb_05_profile pdb_05_pp_res pdb_05_pp_clu  --threads 110 --cluster-mode 1 --similarity-type 1

and the second procedure outputs many warnings like:

ERROR: Sequence ! ERROR: Sequence 19899 does not contain any sequence for key 933920594 does not contain any sequence for key 21608! ERROR: Sequence 20595 does not contain any sequence for key 21655! ERROR: Sequence 20596 does not contain any sequence for key 21670! ERROR: Sequence 2059723896 does not contain any sequence for key ! ERROR: Sequence 23698 does not contain any sequence for key 7521! ERROR: Sequence 23699 does not contain any sequence for key 7526! does not contain any sequence for key 21683!

to give only few.

martin-steinegger commented 6 years ago

I assume the message comes from alignments that find --max-seqs similiar sequences. So the filter would remove all sequences but then add back the query sequence. I would recommend to work on a redundancy reduced database. E.g. Uniclust90 (uniclust.mmseqs.com).

I forgot, you need to add the flat --add-self-matches to your search.

 mmseqs search pdb_05_profile pdb_05_rep pdb_05_pp_res tmp  --threads 110 -e 0.05 --add-self-matches
 mmseqs clust pdb_05_profile pdb_05_pp_res pdb_05_pp_clu  --threads 110 --cluster-mode 1 --similarity-type 1

Do you use MMseqs2 on a computer with 110 core?

ksteczk commented 6 years ago

Thanks for the hint. I'll try that.

I'm using redundancy reduced database - pdb_05 is PDB Seqres database clustered with e-value 0.05, and the profiles are generated for pdb_05 sequences on nr70 database, which is nr clustered to 70% sequence identity.

As to computer - yes, I'm using 120 cores single machine (it has four Xeons E7-4890 v2). The software scales up amazingly. Now I'm computing 6it profiles for 31M sequences and that will be accomplished in few days (prefiltering scores calculation took 65 h 52 m 33s). I see that when calculating S-W alignments the soft crunches the data using all given threads but after finishing million of sequences (or leaning towards finishing that million) it runs gradually on less and lass cores (perhaps some cores were given more time consuming alignments so the program might estimate times and balance threads more efficiently, but this is very minor and would be noticed for users working on big databases only), which might be a speed bottleneck for big dataset such as nr database processed on big machines.

In few days I'll start tests also with MPI run, so that mmseqs2 will be ran on 1600 CPUs on cluster (here I wish the software had used GPUs as the cluster I'm using has additionally some Firepros to use - I write it smiling as MMseqs2 is light years ahead of cd-hit, which I was using to date). I'll give you a feedback on the speed soon then.

martin-steinegger commented 6 years ago

Thanks for your detailed report! The issue you are pointing out is indeed a barrier. I noticed this myself. We remap the mmaped prefilter result file every 1 million queries.

I did this since I noticed a huge slow down if the memory of the machine is not enough to hold the whole alignment file in memory. It seems that the kernel paging is quite slow.

I maybe should add a check if the machine has enough memory to keep the whole file in memory.

ksteczk commented 6 years ago

I maybe should add a check if the machine has enough memory to keep the whole file in memory.

That would be awesome! - I have a machine with 6TB of RAM where I run this clustering so it could handle all of that in memory.

milot-mirdita commented 6 years ago

Martin's fix in db458278e0a42beefe93cb2df4eebfd13cf67d1c should fix the slowdown issue on your server.

martin-steinegger commented 6 years ago

I close this issue now. Please reopen if you still have issues with the profile-profile clustering.

soedinglab / MMseqs2

Profile-profile database clustering #66