soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 199 forks source link

How should I efficiently cluster multiple DBs? #519

Open daron-m-standley opened 2 years ago

daron-m-standley commented 2 years ago

Hi, I am in the process of building a searchable database of antibody and T cell receptor repertoires (here, a "repertoire" is a set of antibody or TCR sequences from a single blood sample from a single donor). Searches are performed using mmseqs, with each repertoire stored as a mmseqs DB. So far, the search function is working nicely. Next, I'd like to implement a clustering option. My idea was to allow a set of repertoire DBs to be selected and clustered using linclust. My questions are:

  1. can either mergedbs or concatdbs be used to combine a set of DBs for clustering by linclust?
  2. is there a more efficient strategy than combining the individual DBs?

Each DB is typically tens of thousands of sequences or more with typical length ~40 amino acids (i.e. just the three CDR regions concatenated; not full-length protein). Thanks in advance for your help!
-Daron

khb7840 commented 2 years ago

I had a similar issue for selecting concatdbs and mergedbs, and the difference was that mergedbs merge the entries from dbs by default.