Developers: Dr. Hsin-Nan Lin, Dr. Yaw-Ling Lin and Dr. Wen-Lian Hsu Institute of Information Science, Academia Sinica, Taiwan.
Characterizing the taxonomic diversity of a microbial community is very important to understand the roles of microorganisms. Next generation sequencing (NGS) provides great potential for investigation of a microbial community and leads to Metagenomic studies. NGS generates DNA sequences directly from microorganism samples, and it requires analysis tools to identify microbial species (or taxonomic composition) and estimate their relative abundance in the studied community. Here we developed a novel metagenomic analysis tool, called StrainPro, which is highly accurate both at characterizing microorganisms at strain-level and estimating their relative abundances. A unique feature of StrainPro is it identifies representative sequence segments from reference genomes. We generate three simulated datasets using known strain sequences and another three simulated datasets using unknown strain sequences.
Please use the command
$ git clone https://github.com/hsinnan75/StrainPro.git
to download the package of StrainPro.
To compile StrainPro, it requires libboost-all-dev, libbz2-dev, and liblzma-dev installed in your system.
Please change to StrainPro's folder and type 'make' to compile all programs of StrainPro.
setenv StrainPro_DIR StrainPro_download_path/bin
Please run "run_test.sh" to test StrainPro.
To download taxonomy information, please use the script "download_taxonomy.sh" to download taxonomy files (requisite for StrainPro).
$./download_taxonomy.sh
If you have downloaded or customized taxonomy dump files, you may specify the file path and run StrainPro with '-dump TaxonomyPath'. StrainPro will find "nodes.dmp" and "merged.dmp" in the designated directory.
To download reference genomes, please use the script "download_genomic_library.sh" to download reference genomes.
$./download_genomic_library.sh library (library: archaea | bacteria | viral | fungi | human)
The script will download all complete chromosomes/genomes of the designated library.
If you would like to use a customized reference library, please make sure your reference genomes follow the below requirements:
taxid|562|NZ_CP027599.1 Escherichia coli strain 97-3250 ATCCCGGCCCCGGCAGAACCGACCTATCGTTCTAACGTAAACGT....
To build a database, please use the StrainPro-build command
$bin/StrainPro-build -r reference-fna -o ref_idx [ref_idx is the output folder for BWT indexes].
It will take a while to process the genome sequences and build BWT indexes.
To characterize the taxonomic composition of the input metagenomic data, please use the StrainPro-map command
$bin/StrainPro-map -i idx -f read -o profile.txt [idx=BWT_index prefix / folder of BWT_indexes, read=NGS_dataset]
You may specify a specific index or a folder of multiple indexes. A specific index is designated with its prefix. The input NGS data is in FASTQ or FASTA format. Multiple files are separated with a space character. The output is a text file.
StrainPro ouputs the taxonomic composition of the input metagenomic data directly. The output format is
TaxID Read_count Est_depth Est_relative_abundance Confidence_score
where TaxID is the NCBI taxon identifier; Read_count is the number of reads that are classified into that taxon id; Est_depth is the estimated read depth of that taxon id; Est_relative_abundance is the estimated relative abundance (percentage); Confidence_score is the confidence score of that prediction.