Open danny305 opened 3 weeks ago
Essentially, if you could help me figure out how to output an a3m file where the query sequence is in the qaln
format and the aligned sequences are in the taln
format (referencing foldseek easy-search --format-output). That would be amazing.
I would recommend against using --allow-deletion
, it was never fully implemented and can easily overflow memory and crash. I think we allocate 2x the memory if allow-deletion is activated, but the MSA can grow much beyond 2x length. However, figuring this out correctly is a bit finicky and we never really needed this internally.
I don't think we have a good solution, except to continue post-processing.
However, isn't your post-processing step essentially just removing all lowercase letters? The indicate gaps in all other sequences in the A3M format.
Is this also true for —allow-deletions in MMSeqs2? Or just foldseek?
Yeah we just remove the lowercase letters or do 1-1 alignments to have indels in both the query and target sequence.
On Fri, Jun 7, 2024 at 11:20 PM Milot Mirdita @.***> wrote:
I would recommend against using --allow-deletion, it was never fully implemented and can easily overflow memory and crash. I think we allocate 2x the memory if allow-deletion is activated, but the MSA can grow much beyond 2x length. However, figuring this out correctly is a bit finicky and we never really needed this internally.
I don't think we have a good solution, except to continue post-processing.
However, isn't your post-processing step essentially just removing all lowercase letters? The indicate gaps in all other sequences in the A3M format.
— Reply to this email directly, view it on GitHub https://github.com/steineggerlab/foldseek/issues/284#issuecomment-2155800948, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHPMKKWB7ANBT2ROAR223OTZGKBCDAVCNFSM6AAAAABI7EUE5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVHAYDAOJUHA . You are receiving this because you authored the thread.Message ID: @.***>
The same code is run for both.
Expected Behavior
Rather than outputting the MSA files I get a segmentation fault.
Current Behavior
I get a segmentation fault when I add the
--allow-deletion
flag. Works when I don't use the flag.Steps to Reproduce (for bugs)
Please make sure to execute the reproduction steps with newly recreated and empty tmp folders. Here is the command I was running using a database of 5 pdb files with 8 total chains:
foldseek result2msa foldseek_DBs/db foldseek_DBs/db interm/aln msa/msa --msa-format-mode 6 --allow-deletion
Foldseek Output (for bugs)
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
conda install bioconda::foldseek bioconda::mmseqs2
Also if you could provide an in-depth explanation of what
--allow-deletion
does I would very much appreciate it! In the past, I get mixed results when I use it with MMSeqs2 and I am not sure exactly when and how to use it in MMSeqs2.From my understanding it allows deletions in the query sequence--adds gaps ("-") to the query sequence. I am trying to use it for better column/residue alignments while preserving insertions in other sequences in the MSA. However, this does not work. Currently, my solution/hack is to post-process and delete the insertions to make sure I have a consistent alignment down a column/residue.
It would be great if I didn't have to delete the insertions but instead allow deletions in the query sequence so I do not have to post-process the a3m file. Ideally, I would not have to delete any insertions and while having every column/residue aligned.
Let me know if I could help in any way!
Danny