metaphysicser / FMAlign2

FMAlign2: A novel fast multiple nucleotide sequence alignment method for ultra-long datasets
https://github.com/metaphysicser/FMAlign2
Apache License 2.0
8 stars 2 forks source link
msa parrallel-computing suffix-array

FMAlign2: A novel fast multiple nucleotide sequence alignment method for ultra-long datasets

FMAlign2 is a novel multiple sequence alignment algorithm based on FMAlign. It is designed to efficiently align ultra-long nucleotide sequences fast and accurately.

Table of Contents

Installation

The program is supported both on Linux and Windows(Linux is strongly recommended for its convenience and better performance). Please make sure your computer meets the following requirements:


If you have ensured that your system meets the requirements mentioned above, you can proceed with the following steps to compile the executable file. However, you also have the option to directly use the pre-compiled executable file available in the Release.

  1. DownLoad

    git clone https://github.com/metaphysicser/FMAlign2.git
    cd FMAlign2
    # for Linux
    chmod 777 ./ext/mafft/linux/usr/libexec/mafft/disttbfast
  2. Build

    cd FMAlign2 && make [M64=1]

    Switch to the FMAlign2 directory in your terminal and execute the above command to build the project. We provide two compilation modes: 32-bit and 64-bit. In most cases, the 32-bit mode is sufficient to handle most data. However, if the concatenated length of all sequences exceeds the range of uint32_t (4294967295), you should add the M64 parameter when compiling the program to generate a 64-bit executable.

    • If you don't need the 64-bit mode, simply execute the make command.
    • If you need the 64-bit mode, execute the make M64=1 command.

    During the compilation process, please be patient as the time required depends on the size and complexity of the project.

    Once the compilation is complete, you will find the generated executable file in the specified output directory.

    Note: If you want to remove all the generated .o files, you can execute the following command:

    make clean

    This command will clean up the intermediate object files and leave only the source code and executable file in the project directory. Use this command when you want to start a fresh build or clean up unnecessary files to save disk space.


Please note that if you choose halign2 and halign3 as your multiple sequence alignment methods, make sure you have Java environment installed. To check the version of Java installed on your system, you can open a command prompt or terminal and execute the following command:

java -version

This will display the installed Java version information.

If you don't have Java installed or if the installed version is not compatible, you can follow these steps to install Java:

To install Java on Windows:

  1. Visit the official Java website at java.com or the OpenJDK website at openjdk.java.net.
  2. Download the appropriate Java Development Kit (JDK) for Windows.
  3. Run the downloaded installer and follow the on-screen instructions to complete the installation.
  4. After the installation is complete, open a new Command Prompt and run java -version to verify that Java is installed and the correct version is displayed.

To install Java on Linux:

  1. Update Package Lists: Run the command sudo apt update to update the package lists on your system.
  2. Install OpenJDK: Run the command sudo apt install default-jdk to install the default version of OpenJDK.
  3. Verify Installation: After the installation is complete, run java -version to verify that Java is installed and the correct version is displayed.

Once you have Java installed and verified the version, you should be able to use halign2 and halign3 for multiple sequence alignment.

Usage

Reminder: Please ensure that all external files (such as MAFFT, HALIGN, etc.) are properly copied to their corresponding directories. Pay close attention to the relative paths between FMAlign2 and the ext folder to avoid issues during execution,

if you are Linux user:

   ./FMAlign2 -i /path/to/data [other options]

if you are Windows user:

   ./FMAlign2.exe -i /path/to/data [other options]

if you want to show the parameters details:

 ./FMAlign2 -h

Parameters Details:


We will demonstrate with the example data mt1x.fasta, assuming you are running on a Linux system.

./FMAlign2 -i ./data/mt1x.fasta -l 20 -c 1 -p mafft -f gloabl -o output.fmaligned2.fasta

This command specifies the following options:

After running this command, you will obtain the aligned output in the output.fmaligned2.fasta file.


If you want to evaluate the generated alignment results, you can run the sp.py script (requires a Python environment) with the following parameters:

python sp.py --input output.fmalign2.fasta --match 0 --mismatch 1 --gap1 2 --gap2 2

This command will calculate and print the SP (Sum-of-Pairs) score for the multiple sequence alignment results. The --input parameter specifies the input alignment file (output.fmalign2.fasta in this case), and the --match, --mismatch, --gap1, and --gap2 parameters define the scoring scheme for matches, mismatches, and gap penalties.

By running this command, you will obtain the SP score, which provides an evaluation of the alignment quality.

Data

Data can be assessed in data fold. All the data is compressed using xz compression. Before using it, please decompress the files.

Here are the methods to decompress the files on different operating systems:

Decompressing on Linux:

  1. Open the terminal.

  2. Navigate to the directory where the compressed file is located.

  3. Run the following command to decompress the file:

    xz -d filename.xz

    Replace filename.xz with the name of the file you want to decompress.

Decompressing on Windows:

  1. Download and install an xz compression tool for Windows, such as 7-Zip or WinRAR.
  2. Right-click on the compressed file.
  3. Select "Extract to" or a similar option to decompress the file.

Please note that the decompressed files will occupy more disk space. Make sure you have enough disk space to store the uncompressed files.

If you need more data, you can visit http://lab.malab.cn/~cjt/MSA/datasets.html for more datasets.

Issue

FMAlign2 is supported by ZOU's Lab. If you have any suggestions or feedback, we encourage you to provide them through the issue page on the project's repository. You can also reach out via email to zpl010720@gmail.com.

We value your input and appreciate your contribution to improving the project. Thank you for taking the time to provide feedback, and we will address your concerns as soon as possible.

Related

Citation

Pinglu Zhang, Huan Liu, Yanming Wei, Yixiao Zhai, Qinzhong Tian, Quan Zou, FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets, Bioinformatics, 2024;, btae014, https://doi.org/10.1093/bioinformatics/btae014

License

Apache 2.0 © [MALABZ_UESTC Pinglu Zhang]