mfruzan / CoreDetector

MIT License
25 stars 0 forks source link

CoreDetector Multiple Genome Aligner

Project Status: Active – The project has reached a stable, usable state and is being actively developed. GitHub License Static Badge Hits

CoreDetector is a new fast and flexible program that is able to identify the core-genome sequence of larger and more evolutionary diverse genomes.

Quick start

Installation and configuration of CoreDetector on Linux-based operating systems proceeds as follows.

Step 1. Configure your $PATH for CoreDetector binary dependencies

CoreDetector depends on the Minimap2 versatile pairwise aligner (and its related paftools.js utility), as well as the K8 Javascript shell. The easiest way is to install these to a prepared folder on the system $PATH for them, so that they are always available when CoreDetector runs:

mkdir -p $HOME/bin
echo "export PATH=$HOME/bin:${PATH}" >> $HOME/.bashrc && source $HOME/.bashrc

Step 2. Download and install Minimap2 (v2.26)

Grab the v2.26 release of Minimap2 from its GitHub repository here. Alternatively, copy-paste the below commands to automatically download, compile and configure Minimap2. (Note: this compilation requires compiler tools and the zlib development headers to be installed: on Ubuntu 22.04, you can easily install these compilation dependencies with sudo apt-get -y install build-essential zlib1g-dev. You might need to run sudo apt-get update before installing build-essential and zlib1g-dev)

wget "https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26.tar.bz2"
tar -xjf minimap2-2.26.tar.bz2
cd minimap2-2.26 && make
cp minimap2 misc/paftools.js $HOME/bin/
cd ..

Step 3. Download and install K8 (v1.0)

Get the v1.0 release of the K8 Javascript shell from its GitHub repository here. Alternatively, execute the following commands to automatically download and configure the precompiled K8 binary:

wget "https://github.com/attractivechaos/k8/releases/download/v1.0/k8-1.0.tar.bz2"
tar -xjf k8-1.0.tar.bz2
cp k8-1.0/k8-x86_64-Linux $HOME/bin/k8

Step 4. Install a Java runtime/development kit

OpenJDK-11 (or later versions) have been confirmed to work well with CoreDetector. For most Linux systems, these are easily installed via the package manager. E.g., to install OpenJDK-11 (the default JDK) on Ubuntu 22.04:

sudo apt-get -y install openjdk-11-jdk  # or default-jdk

Step 5. Download CoreDetector and run an example pipeline

Finally, pull this GitHub repository to download the CoreDetector tool, and run a test case on the provided example set of genomes to confirm that the tool is working correctly.

git clone https://github.com/mfruzan/CoreDetector.git
cd CoreDetector
chmod +x pipeline_Minimap.sh

./pipeline_Minimap.sh -g example/quick_genomes.txt -o example_out -d 20 -n 16

Quick start (using Docker)

Alternatively, easily set up CoreDetector in a Docker container using the provided Dockerfile, which completely automates the installation. For information about setting up Docker on Windows/Mac/Linux and using containers, see docs.docker.com.

git clone https://github.com/mfruzan/CoreDetector.git
cd CoreDetector
sudo docker build -t coredetector .
sudo docker run -it -v $(pwd)/example:/example coredetector

In the interactive shell for the container, you can immediately run the Multiple Genome Aligner tool:

./pipeline_Minimap.sh -g example/quick_genomes.txt -o example/output -d 20 -n 16

Usage

Use the CoreDetector multiple alignment tool (with the Minimap2 pipeline) as follows:

./pipeline_Minimap.sh -g <genome_list> -o <out_dir> -d <divergence> -n <ncpus> -m <minlength> -c <chromosome>

The main input file for CoreDetector is the <genome_list> text file, consisting of lines of genomes:

Alg130  example/Alg130.fna
DW5 example/DW5.fna
M4  example/M4.fna

Each line contains an alias name (e.g., Alg130, DW5), followed by a space/Tab, then followed by the filepath to the FASTA file for that genome. In this example, Alg130 is the query genome, and the rest of the genomes become the subjects. This text file is passed to ./pipeline_Minimap.sh using the -g flag.

The -o argument specifies the output directory. CoreDetector generates two output files in the specified output folder: msa.maf and concatinated_msa.fa. Note that the directory will be created if it does not already exist.

The -d argument is the expected divergence level, and can be any integer between 1 and 40.

Other arguments to CoreDetector are optional, and allow fine-tuning of the program configuration:

The CoreDetector Manual explains program usage in detail, and lists further analysis examples.