steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

Segmentation fault when using --allow-deletion with result2msa #284

Open danny305 opened 3 weeks ago

danny305 commented 3 weeks ago

Expected Behavior

Rather than outputting the MSA files I get a segmentation fault.

Current Behavior

I get a segmentation fault when I add the --allow-deletion flag. Works when I don't use the flag.

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders. Here is the command I was running using a database of 5 pdb files with 8 total chains: foldseek result2msa foldseek_DBs/db foldseek_DBs/db interm/aln msa/msa --msa-format-mode 6 --allow-deletion

Foldseek Output (for bugs)

result2msa foldseek_DBs/db foldseek_DBs/db interm/aln msa/msa --msa-format-mode 6 --allow-deletion 

MMseqs Version:                 bb090174ab59557ff9ffc874598f4c3904f55bc6
Substitution matrix             aa:3di.out,nucl:3di.out
Gap open cost                   aa:10,nucl:10
Gap extension cost              aa:1,nucl:1
Allow deletions                 true
Compositional bias              1
Compositional bias              1
MSA format mode                 6
Summary prefix                  cl
Skip query                      false
Filter MSA                      0
Use filter only at N seqs       0
Maximum seq. id. threshold      0.9
Minimum seq. id.                0.0
Minimum score per column        -20
Minimum coverage                0
Select N most diverse seqs      1000
Preload mode                    0
Threads                         128
Compressed                      0
Verbosity                       3

Query database size: 8 type: Aminoacid
Target database size: 8 type: Aminoacid
./run_hada_msa.sh: line 3: 2311279 Segmentation fault      (core dumped) foldseek result2msa foldseek_DBs/db foldseek_DBs/db interm/aln msa/msa --msa-format-mode 6 --allow-deletion

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Also if you could provide an in-depth explanation of what --allow-deletion does I would very much appreciate it! In the past, I get mixed results when I use it with MMSeqs2 and I am not sure exactly when and how to use it in MMSeqs2.

From my understanding it allows deletions in the query sequence--adds gaps ("-") to the query sequence. I am trying to use it for better column/residue alignments while preserving insertions in other sequences in the MSA. However, this does not work. Currently, my solution/hack is to post-process and delete the insertions to make sure I have a consistent alignment down a column/residue.

It would be great if I didn't have to delete the insertions but instead allow deletions in the query sequence so I do not have to post-process the a3m file. Ideally, I would not have to delete any insertions and while having every column/residue aligned.

Let me know if I could help in any way!

Danny

danny305 commented 3 weeks ago

Essentially, if you could help me figure out how to output an a3m file where the query sequence is in the qaln format and the aligned sequences are in the taln format (referencing foldseek easy-search --format-output). That would be amazing.

milot-mirdita commented 3 weeks ago

I would recommend against using --allow-deletion, it was never fully implemented and can easily overflow memory and crash. I think we allocate 2x the memory if allow-deletion is activated, but the MSA can grow much beyond 2x length. However, figuring this out correctly is a bit finicky and we never really needed this internally.

I don't think we have a good solution, except to continue post-processing.

However, isn't your post-processing step essentially just removing all lowercase letters? The indicate gaps in all other sequences in the A3M format.

danny305 commented 3 weeks ago

Is this also true for —allow-deletions in MMSeqs2? Or just foldseek?

Yeah we just remove the lowercase letters or do 1-1 alignments to have indels in both the query and target sequence.

On Fri, Jun 7, 2024 at 11:20 PM Milot Mirdita @.***> wrote:

I would recommend against using --allow-deletion, it was never fully implemented and can easily overflow memory and crash. I think we allocate 2x the memory if allow-deletion is activated, but the MSA can grow much beyond 2x length. However, figuring this out correctly is a bit finicky and we never really needed this internally.

I don't think we have a good solution, except to continue post-processing.

However, isn't your post-processing step essentially just removing all lowercase letters? The indicate gaps in all other sequences in the A3M format.

— Reply to this email directly, view it on GitHub https://github.com/steineggerlab/foldseek/issues/284#issuecomment-2155800948, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHPMKKWB7ANBT2ROAR223OTZGKBCDAVCNFSM6AAAAABI7EUE5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVHAYDAOJUHA . You are receiving this because you authored the thread.Message ID: @.***>

milot-mirdita commented 3 weeks ago

The same code is run for both.