soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 192 forks source link

Segmentation fault (Error: Alignment died) on ColabFold envdb when sequence length is less than 12 aa long #538

Open knuser opened 2 years ago

knuser commented 2 years ago

Expected Behavior

Don't crash on envdb when sequence length is less than 12 aa long (for example on SEGGQDFWL or GSSGLISMPRV).

Current Behavior

MMseqs2 process crashes on aligning ColabFold envdb every time if input .fasta file contains short sequence (this also happens if .fasta file contains more than one sequence). UniRef database is processed every time without issue, crash happens only on envdb processing.

align results_700_only_456_fasta_700_5/prof_res ../db_sources/colabfold_envdb_202108_db.idx results_700_only_456_fasta_700_5/tmp/17071544472219224293/pref_0 results_700_only_456_fasta_700_5/tmp/17071544472219224293/aln_0 --sub-mat aa:blosum62.out,nucl:nucleotide.out -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 1 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 64 --compressed 0 -v 3

Index version: 16
Generated by:  fcf52600801a73e95fd74068e1bb1afb437d719d
ScoreMatrix:  VTML80.out
Compute score only
Query database size: 1 type: Profile
Target database size: 209335862 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 1 eta -
Segmentation fault (core dumped)
Error: Alignment died
Traceback (most recent call last):
  File "/home/x/genomic/alphafold2/venv38alphafold2/bin/colabfold_search", line 8, in <module>
    sys.exit(main())
  File "/home/x/genomic/alphafold2/venv38alphafold2/lib/python3.8/site-packages/colabfold/mmseqs/search.py", line 180, in main
    mmseqs_search(
  File "/home/x/genomic/alphafold2/venv38alphafold2/lib/python3.8/site-packages/colabfold/mmseqs/search.py", line 100, in mmseqs_search
    run_mmseqs(mmseqs, ["search", base.joinpath("prof_res"), dbbase.joinpath(metagenomic_db), base.joinpath("res_env"), base.joinpath("tmp"), "--threads", str(threads)] + search_param)
  File "/home/x/genomic/alphafold2/venv38alphafold2/lib/python3.8/site-packages/colabfold/mmseqs/search.py", line 21, in run_mmseqs
    subprocess.check_call([mmseqs] + params)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '[PosixPath('/home/x/genomic/mmseqs2/MMseqs2/build/bin/mmseqs'), 'search', PosixPath('results_700_only_456_fasta_700_5/prof_res'), PosixPath('../db_sources/colabfold_envdb_202108_db'), PosixPath('results_700_only_456_fasta_700_5/res_env'), PosixPath('results_700_only_456_fasta_700_5/tmp'), '--threads', '64', '--num-iterations', '3', '--db-load-mode', '2', '-a', '-s', '8', '-e', '0.1', '--max-seqs', '10000']' returned non-zero exit status 1.

Steps to Reproduce (for bugs)

Put in input_sequences.fasta anywhere (it affects single entry fasta and also miltientry fasta) one of those examples:

Setup ColabFold databases from https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh Run colabfold_search input_sequences.fasta /path/to/db_folder search_results you will see above crash

OR

Go to https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb and try to fold one of the examples, you will see:

Exception: MMseqs2 API is giving errors. Please confirm your input is a valid protein sequence. If error persists, please try again an hour later.

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

Segmentation fault (core dumped)
Error: Alignment died

Context

If you will extend crashing examples to 12aa then mmseqs will work correctly. Is seems that 12 is some kind of magic barrier in examples I found.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

knuser commented 2 years ago

Update, I had just found 13aa length example, which is causing segfault: TDPPIHIASLXRS

Observation: after changing X to, for example G (TDPPIHIASLGRS), MMseqs2 will process example correctly

EDIT, another segfault example, this time much longer: DPLVFFKXXFXXGGGGGAGCGGGGMKRT, (observation, extended version will be processed correctly: DPLVFFKXXFXXGGGGGAGCGGGGMKRTRRALPAN)