soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.46k stars 197 forks source link

Issues integrating GPU-accelerated search in colabfold alignment protocol #904

Open clami66 opened 13 hours ago

clami66 commented 13 hours ago

I am trying to integrate the new GPU-accelerated search in colabfold_search. From what I can see, only search and easy-search are GPU-accelerated. However, the colabfold_search alignment protocol also includes a expandaln step (among others).

Unfortunately, it seems like expandaln is incompatible with the padded sequence DB generated and indexed for GPU, as running mmseqs expandaln on this database will cause it to crash. I think this is because the database .idx.index file lacks rows 24-25, i.e. ALNINDEX, ALNDATA as defined here: https://github.com/soedinglab/MMseqs2/blob/266c894c117a9bd650450974747424ce51124bf5/src/prefiltering/PrefilteringIndexReader.cpp#L33C1-L34C52

I thought that this was due to using the --index-subset 2 flag when running mmseqs createindex as recommended in the guide, but even using --index-subset 0 doesn't fix the issue for me.

Now I am wondering if the whole alignment protocol should change (e.g. by removing expandaln altogether) or perhaps there is something I am doing incorrectly when setting the database up? Thanks for any help on this!

Steps to Reproduce (for bugs)

  1. Generate the padded DB: mmseqs makepaddedseqdb uniref30_2302_db uniref30_2302_db_gpu

  2. Generate the index (either with --index-subset 0 or --index-subset 2)

    
    $ mmseqs createindex uniref30_2302_db_gpu tmp --split 0 --index-subset 0
    ...
    Write VERSION (0)
    Write META (1)
    Write SCOREMATRIXNAME (2)
    Write SPACEDPATTERN (23)
    Write GENERATOR (22)
    Write DBR1INDEX (5)
    Write DBR1DATA (6)
    Write HDR1INDEX (18)
    Write HDR1DATA (19)
    Write SCOREMATRIX3MER (4)
    Write SCOREMATRIX2MER (3)
    ...
    Write ENTRIES (9)
    Write ENTRIESOFFSETS (10)
    Write SEQINDEXDATASIZE (15)
    Write SEQINDEXSEQOFFSET (16)
    Write SEQINDEXDATA (14)
    Write ENTRIESNUM (12)
    Write SEQCOUNT (13)
3. The resulting `.idx.index` file lacks rows 24-25:

$ tail uniref30_2302_db_gpu.idx.index ... 21 10770190336 105711065 22 20480 41 23 16384 1


4. Run `mmseqs expandaln`

mmseqs expandaln ./example/qdb colabfold_databases/uniref30_2302_db_gpu.idx ./example/res colabfold_databases/uniref30_2302_db_gpu.idx ./res_exp


## MMseqs Output

`expandaln` crashes while attempting to load the index:

MMseqs Version: dc7395810db17ec7de8adf32599562452b0c4d78 Expansion mode 0 Substitution matrix aa:blosum62.out,nucl:nucleotide.out Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Max sequence length 65535 Score bias 0 Compositional bias 1 Compositional bias 1 E-value threshold 0.001 Seq. id. threshold 0 Coverage threshold 0 Coverage mode 0 Pseudo count mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Expand filter clusters 0 Use filter only at N seqs 0 Maximum seq. id. threshold 0.9 Minimum seq. id. 0.0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Preload mode 0 Compressed 0 Threads 128 Verbosity 3

Index version: 16 Generated by: dc7395810db17ec7de8adf32599562452b0c4d78 ScoreMatrix: VTML80.out Index version: 16 Generated by: dc7395810db17ec7de8adf32599562452b0c4d78 ScoreMatrix: VTML80.out Invalid database read for database data file=colabfold_databases/uniref30_2302_db_gpu.idx, database index=colabfold_databases/uniref30_2302_db_gpu.idx.index getData: local id (4294967295) >= db size (22)



## Your Environment

* MMseqs2 commit: dc73958
* Compiled with `DENABLE_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES="75;80;86;89;90"`
* CUDA environment spec: `gcccuda/12.1.1-gcc12.3.0`
* System: NVIDIA SuperPOD/DGX-A100 - Linux
milot-mirdita commented 12 hours ago

Still working on it, we'll likely release the changes to do ColabFold with MMseqs2-GPU this weekend. colabfold_search doesn't actually require any changes directly. The new protocol can be activated with environment variables only, after building GPU databases.

clami66 commented 12 hours ago

Thanks for responding so quickly, I will keep an eye out for the updates