sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.99k stars 499 forks source link

custom databases with colabfold_search #462

Open seanrjohnson opened 1 year ago

seanrjohnson commented 1 year ago

I'm trying to predict structures for a bunch of sequences from the same family. For this reason, I don't need to search against the entire uniprot30 or envdb. I just want to make a reference database from the sequences themselves (only a few thousand) and generate the MSAs from that search.

Can you recommend a way to do this?

With on a subset of my sequences of interest in test_queries.fasta (in this case, just two sequences, with names 1 and 2). I have tried:

mkdir dbs
cp test_queries.fasta dbs
cd dbs
mmseqs createdb test_queries.fasta query_db
mmseqs createindex query_db tmpdir
cd ..
colabfold_search --db1 query_db --use-env 0 --use-templates 0 test_queries.fasta dbs msas

I see the error:

Invalid database read for database data file=dbs/query_db.idx, database index=dbs/query_db.idx.index
getData: local id (4294967295) >= db size (22)
Traceback (most recent call last):
  File "/home/sean/miniconda3/envs/colabfold_1_5_2/bin/colabfold_search", line 8, in <module>
    sys.exit(main())
  File "/home/sean/miniconda3/envs/colabfold_1_5_2/lib/python3.7/site-packages/colabfold/mmseqs/search.py", line 444, in main
    threads=args.threads,
  File "/home/sean/miniconda3/envs/colabfold_1_5_2/lib/python3.7/site-packages/colabfold/mmseqs/search.py", line 86, in mmseqs_search_monomer
    run_mmseqs(mmseqs, ["expandaln", base.joinpath("qdb"), dbbase.joinpath(f"{uniref_db}{dbSuffix1}"), base.joinpath("res"), dbbase.joinpath(f"{uniref_db}{dbSuffix2}"), base.joinpath("res_exp"), "--db-load-mode", str(db_load_mode), "--threads", str(threads)] + expand_param)
  File "/home/sean/miniconda3/envs/colabfold_1_5_2/lib/python3.7/site-packages/colabfold/mmseqs/search.py", line 23, in run_mmseqs
    subprocess.check_call([mmseqs] + params)
  File "/home/sean/miniconda3/envs/colabfold_1_5_2/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '[PosixPath('mmseqs'), 'expandaln', PosixPath('msas/qdb'), PosixPath('dbs/query_db.idx'), PosixPath('msas/res'), PosixPath('dbs/query_db.idx'), PosixPath('msas/res_exp'), '--db-load-mode', '0', '--threads', '64', '--expansion-mode', '0', '-e', 'inf', '--expand-filter-clusters', '1', '--max-seq-id', '0.95']' returned non-zero exit status 1.

Is there a generic version of the a3m pipeline that I can use with an arbitrary reference database?

I tried first with the mmseqs2 from conda. Then, thinking it might be some weird issue with the binaries, I downloaded the source and recompiled, but it didn't help.

kfletcher88 commented 1 year ago

I have been able to generate a custom MSA for one query protein with a prebuilt target database by:

mkdir -p tmp
mmseqs createdb [Query].faa [Query].db
mmseqs search [Query].db [Target].db [Query]x[Target].db ./tmp
mmseqs result2msa [Query].db [Target].db [Query]x[Target].db [Query]x[Target].a3m --msa-format-mode 5
colabfold_batch [Query]x[Target].a3m [Query]x[Target]_out

Probably, to build the target database, you need to use (untested):

mmseqs createdb [Target].fasta [Target].db
mmseqs createindex [Target].db tmp --remove-tmp-files 1

Hope it helps!