soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.32k stars 185 forks source link

How should I efficiently cluster multiple DBs? #519

Open daron-m-standley opened 2 years ago

daron-m-standley commented 2 years ago

Hi, I am in the process of building a searchable database of antibody and T cell receptor repertoires (here, a "repertoire" is a set of antibody or TCR sequences from a single blood sample from a single donor). Searches are performed using mmseqs, with each repertoire stored as a mmseqs DB. So far, the search function is working nicely. Next, I'd like to implement a clustering option. My idea was to allow a set of repertoire DBs to be selected and clustered using linclust. My questions are:

  1. can either mergedbs or concatdbs be used to combine a set of DBs for clustering by linclust?
  2. is there a more efficient strategy than combining the individual DBs?

Each DB is typically tens of thousands of sequences or more with typical length ~40 amino acids (i.e. just the three CDR regions concatenated; not full-length protein). Thanks in advance for your help!
-Daron

khb7840 commented 1 year ago

I had a similar issue for selecting concatdbs and mergedbs, and the difference was that mergedbs merge the entries from dbs by default.