minmarg / gtalign_alpha

GTalign, HPC protein structure alignment, superposition and search (alpha release)
Apache License 2.0
19 stars 2 forks source link

Release DOI Header image

GTalign (alpha release)

GTalign, a novel high-performance (HPC) protein structure alignment, superposition and search method (with flexible structure clustering ability)

Features

A note on the CPU/multiprocessing version

GTalign is optimized to run on GPUs. Its CPU/multiprocessing version is based on the algorithms developed for GPUs, but the implementation details differ. The CPU/multiprocessing version produces similar results, but superpositions and alignments produced for some structure pairs may be different. TM-scores and RMSDs are correct within numerical error.

The CPU/multiprocessing version using 20 threads is 10-20x slower than the GPU version running on a V100. The difference also depends on the options used: It increases with decreasing --speed value.

Available Platforms

The GTalign source code should compile and run on Linux, MS Windows, and MaxOS. GTalign was tested on and the binaries are provided for the following platforms:

System requirements (GPU version)

System requirements (CPU/multiprocessing version)

Installation of pre-compiled binaries

Download or clone the repository:

git clone https://github.com/minmarg/gtalign_alpha.git

On Linux, run the shell scripts, for the GPU and CPU versions, respectively, and follow the instructions:

Linux_installer_GPU/GTalign-linux64-installer-GPU.sh

Linux_installer_mp/GTalign-linux64-installer-mp.sh

On MS Windows 10/11, run the GPU-version installer:

MS_Windows10_installer_GPU/GTalign-win64-installer.msi

Installation from source code

Installation on Linux and macOS

Software requirements

To build and install the GTalign software from the source code on Linux or macOS (CPU version), these tools are required to be installed:

Installation

Run the shell script for the GPU (Linux) and CPU versions, respectively, using GCC or LLVM/Clang compilers (takes several minutes to compile):

BUILD_and_INSTALL__GPU__unix.sh

BUILD_and_INSTALL__GPU__unix__clang.sh

BUILD_and_INSTALL__mp__unix.sh

BUILD_and_INSTALL__mp__unix__clang.sh

Installation on MS Windows

Software requirements

To build and install the GTalign software from the source code on MS Windows, these tools are required to be installed:

Installation

Run the command (batch) file for the GPU and CPU versions, respectively:

BUILD_and_INSTALL__GPU__win64.cmd

BUILD_and_INSTALL__mp__win64.cmd

Getting started

Type gtalign for a description of the options.

Query structures and/or directories with queries are specified with the option --qrs. Reference structures (to align queries with) and/or their directories to be searched are specified with the option --rfs.

Note that GTalign reads .tar archives of compressed and uncompressed structures, meaning that big structure databases such as AlphaFold2 and ESM archived structural models are ready for use once downloaded.

Here are some examples:

gtalign -v --qrs=str1.cif.gz --rfs=my_huge_structure_database.tar -o my_output_directory

gtalign -v --qrs=struct1.pdb --rfs=struct2.pdb,struct3.pdb,struct4.pdb -o my_output_directory

gtalign -v --qrs=struct1.pdb,my_struct_directory --rfs=my_ref_directory -o my_output_directory

gtalign -v --qrs=str1.pdb.gz,str2.cif.gz --rfs=str3.cif.gz,str4.ent,my_ref_dir -s 0 -o mydir

Queries and references are processed in chunks. The maximum total length of queries in one chunk is controlled with the option --dev-queries-total-length-per-chunk. The maximum length of a reference structure can be specified with the option --dev-max-length. Larger structures will be skipped during a search. A good practice is to keep --dev-max-length reasonably large (e.g., <10000; unless your set of references are all larger) so that many structure pairs are processed in parallel.

For comparing protein complexes, it usually suffices to set --ter=0. The options --ter=0 --split=2 are used to consider all chains present in structure files when executing the program.

Alignment sorting

GTalign offers the --sort option to arrange alignment based on various criteria. Users can choose to sort alignments by TM-score, RMSD (root-mean-squared deviation), or the secondary TM-score, 2TM-score, which is calculated over the alignment while excluding unmatched helices. Consequently, the 2TM-score penalizes topological inconsistencies more than the TM-score.

Additionally, the --sort option allows for sorting by the harmonic mean of the TM-scores or 2TM-scores. The harmonic mean is particularly effective in reducing the significance of structural alignments for pairs with large length differences. Therefore, sorting by the harmonic mean may prove beneficial when seeking and analyzing evolutionarily related or structurally similar proteins with length ratios not exceeding several times.

Clustering

The GPU version of GTalign allows for clustering (by complete or single linkage) of large protein structure datasets. This option is as highly configurable as the search. A simplest command line example is:

gtalign -v --cls=my_huge_structure_database.tar -o my_output_directory

which instructs GTalign to cluster structures archived in my_huge_structure_database.tar with default parameters. The superimposed members of a cluster can then be obtained by running gtalign with the first member as query and all others as references and using options --pre-score=0 -s 0 --referenced, which produces transformation matrices for the reference structures to be superimposed on the query.

The clustering options, which can be used in combination with other options to make clustering flexible, can be found in the complete list of options.

GTalign demo notebooks on Google Colab

The GTalign demo notebooks, GTalign_demo and GTalign_demo_search, for Google Colab are available. The first notebook showcases structure alignment for two large protein complexes -- virus nucleocapsid variants 7a4i and 7a4j -- and runs on Google Colab with a Tesla T4 GPU (finishes in a minute). The second demonstrates the alignment of all against all queries of the PDB20 dataset, completing in half a minute.

Citation

If you use the GTalign software or data, please cite:

Margelevicius, M. GTalign: High-performance protein structure alignment, superposition, and search. bioRxiv 2023.12.18.572167; (2023). doi: https://doi.org/10.1101/2023.12.18.572167

@article{Margelevicius2023.12.18.572167,
  author = {Mindaugas Margelevicius},
  title = {GTalign: High-performance protein structure alignment, superposition, and search},
  elocation-id = {2023.12.18.572167},
  year = {2023},
  doi = {10.1101/2023.12.18.572167},
  publisher = {Cold Spring Harbor Laboratory},
  URL = {https://www.biorxiv.org/content/early/2023/12/18/2023.12.18.572167},
  eprint = {https://www.biorxiv.org/content/early/2023/12/18/2023.12.18.572167.full.pdf},
  journal = {bioRxiv}
}

Contacts

Bug reports, comments, suggestions are welcome. If you have other questions, please contact Mindaugas Margelevicius at mindaugas.margelevicius@bti.vu.lt.

License

Copyright 2023 Mindaugas Margelevicius, Institute of Biotechnology, Vilnius University

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Funding

This project received funding from the Research Council of Lithuania (LMTLT; grant S-MIP-23-104).