steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
780 stars 99 forks source link

Multiple Structure Alignment with Foldseek #168

Open lavibig opened 1 year ago

lavibig commented 1 year ago

Dear Team, Thanks for this amazing tool. I'm trying to generate a Multiple Structure Alignment from a set of pdbs. I understand that foldseek runs a pairwise structural alignment during the search combining 3Di+sequence. My question is how to generate a Multiple Structure Alignment. I tried the following procedure, suggested in git:

foldseek createdb example/ targetDB foldseek createdb example/ queryDB foldseek search queryDB targetDB aln tmpFolder -a foldseek result2msa queryDB targetDB aln msa --msa-format-mode 6 foldseek unpackdb msa msa_output --unpack-suffix a3m --unpack-name-mode 0

The problem is, that the sequences in the resulting .a3m files do not look aligned. Am I missing something? Is there a more straight forward way to generate a multiple structure alignment using foldseek?

Lavi

milot-mirdita commented 1 year ago

This should work, however the output format here is a3m, which introduces lower-case letters for positions where gaps should be in all other sequences. This reduces the file-size of MSAs tremendously, however might be confusing if you have never seen it before.

You can either drop the lower-case letters with something like the following:

awk '/^>/ { print; next; } { gsub(/[a-z]/, "", $0); print; }' asd.a3m

Or use --msa-format-mode 2 to generate aligned FASTA.

You can also use the reformat.pl script from HHsuite to convert from a3m to fasta.

lavibig commented 1 year ago

Thanks for your answer. I tried the refomat.pl script. Worked nicely. A related question remains: Using this approach, will the result be a multiple sequence alignment or a multiple structure alignment?

milot-mirdita commented 1 year ago

The result is a multiple amino acid sequence alignment. However, the alignment was done with Foldseek, thus 3DI (structural) and AA information were used (and TMalign information in TM mode).

The result is also a query centric MSA. We are developing a different tool for MSAs of full length aligned structures. We hope to release a preprint for that tool soon.

lavibig commented 1 year ago

Got it. Thanks!