=======================================================================
HyPo--a Hybrid Polisher-- utilises short as well as long reads within a single run to polish a long reads assembly of small and large genomes. It exploits unique genomic kmers to selectively polish segments of contigs using partial order alignment of selective read-segments. As demonstrated on human genome assemblies, Hypo generates significantly more accurate polished assembly in about one-third time with about half the memory requirements in comparison to contemporary widely used polishers like Racon.
Please note that "short reads" doesn't necessarily have to be NGS short reads; HiFi genomic reads (e.g. CCS) like those generated from PacBio SequelII could also be used instead. The requirement is that those reads should be highly accurate (>98% accuracy).
Hypo requires the following as input:
In what follows, short reads can be replaced with HiFi reads.
Broadly, we (conceptually) divide a draft (uncorrected) contig into two types of regions (segments): Strong and Weak. Strong regions are those which have strong evidence (support) of their correctness and thus do not need polishing. Weak regions, on the other hand, will be polished using POA. Each weak region will be polished using either short reads or long reads; short reads taking precedence over long reads. To identify strong regions, we make use of solid kmers (expected unique genomic kmers). Strong regions also play a role in selecting the read-segments to polish their neighbouring weak regions. Furthermore, our approach takes into account that the long reads and thus the assemblies generated from them are prone to homopolymer errors as mentioned in the beginning.
Hypo is only available for Unix-like platforms (Linux and MAC OS). We recommend using Option_1 for a convenient installation and in the case where target machine is different from the one on which hypo is installed. On the other hand, Option_2 is more suitable if the binary of hypo is to be run on the same machine on which it is compiled because then a machine-specific optimised (and thus slightly faster) binary can be produced using the flag -Doptimise_for_native=ON
.
The convenient way of installation is using the conda package as follows:
conda install -c bioconda hypo
Htslib 1.10 may cause conflicts with some of the already installed packages. In such a case, hypo can be installed in a new environment as follows:
conda create --name hypo_env
conda activate hypo_env
conda install -c bioconda hypo
CmakeLists is provided in the project root folder.
For installing from the source, the following requirements are assumed to be installed already (with path to their binaries available in $PATH).
Zlib
OpenMP
GCC (>=7.3)
sudo apt-get update; sudo apt-get install build-essential software-properties-common -y;
sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y; sudo apt update;
sudo apt install gcc-snapshot -y; sudo apt update
sudo apt install gcc-8 g++-8 -y;
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 60 --slave /usr/bin/g++ g++ /usr/bin/g++-8
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 60 --slave /usr/bin/g++ g++ /usr/bin/g++-5;
HTSLIB (version >=1.10)
install_deps.sh
in the project folder to install it locally.conda install -c bioconda kmc
Run the following commands to build a binary (executable) hypo
in build/bin
:
If htslib version 1.10 or higher is installed:
git clone --recursive https://github.com/kensung-lab/hypo hypo
cd hypo
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -Doptimise_for_native=ON ..
make -j 8
If htslib is not installed or the version is smaller than 1.10:
git clone --recursive https://github.com/kensung-lab/hypo hypo
cd hypo
chmod +x install_deps.sh
./install_deps.sh
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -Doptimise_for_native=ON ..
make -j 8
Notes:
--recursive
was omitted from git clone
, please run git submodule init
and git submodule update
before compiling.-Doptimise_for_native=ON
. Usage: hypo <args>
** Mandatory args:
-r, --reads-short <str>
Input file name containing reads (in fasta/fastq format; can be compressed). A list of files containing file names in each line can be passed with @ prefix.
-d, --draft <str>
Input file name containing the draft contigs (in fasta/fastq format; can be compressed).
-b, --bam-sr <str>
Input file name containing the alignments of short reads against the draft (in bam/sam format; must have CIGAR information).
-c, --coverage-short <int>
Approximate mean coverage of the short reads.
-s, --size-ref <str>
Approximate size of the genome (a number; could be followed by units k/m/g; e.g. 10m, 2.3g).
** Optional args:
-B, --bam-lr <str>
Input file name containing the alignments of long reads against the draft (in bam/sam format; must have CIGAR information).
[Only Short reads polishing will be performed if this argument is not given]
-o, --output <str>
Output file name.
[Default] hypo_<draft_file_name>.fasta in the working directory.
-t, --threads <int>
Number of threads.
[Default] 1.
-p, --processing-size <int>
Number of contigs to be processed in one batch. Lower value means less memory usage but slower speed.
[Default] All the contigs in the draft.
-k, --kind-sr <str>
Kind of the short reads.
[Valid Values]
sr (Corresponding to NGS reads like Illumina reads)
ccs (Corresponding to HiFi reads like PacBio CCS reads)
[Default] sr.
-m, --match-sr <int>
Score for matching bases for short reads.
[Default] 5.
-x, --mismatch-sr <int>
Score for mismatching bases for short reads.
[Default] -4.
-g, --gap-sr <int>
Gap penalty for short reads (must be negative).
[Default] -8.
-M, --match-lr <int>
Score for matching bases for long reads.
[Default] 3.
-X, --mismatch-lr <int>
Score for mismatching bases for long reads.
[Default] -5.
-G, --gap-lr <int>
Gap penalty for long reads (must be negative).
[Default] -4.
-n, --ned-th <int>
Threshold for Normalised Edit Distance of long arms allowed in a window (in %). Higher number means more arms allowed which may slow down the execution.
[Default] 20.
-q, --qual-map-th <int>
Threshold for mapping quality of reads. The reads with mapping quality below this threshold will not be taken into consideration.
[Default] 2.
-i, --intermed
Store or use (if already exist) the intermediate files.
[Currently, only Solid kmers are stored as an intermediate file.].
-h, --help
Print the usage.
If no --output
(or -o
) is provided, hypo_X.fasta
will be created in the working directory where
If --intermed
(or -i
) is used, hypo will store the intermediate files corresponding to the solid kmers in a folder named aux
in the first run. Another file indicating the progress of the run will also be created in that folder. In the next run (with -i
), those intermediate files will be used instead of running suk
module. Remove -i
or delete aux
folder to start Hypo from the beginning. It is recommended to use this flag while going for multiple rounds of polishing using Hypo so as to make the subsequent rounds faster.
The option --processing-size
(or -p
) controls the number of contigs processed in one batch. By default, all contigs will be processed in a single batch. More the number of contigs processed in a batch higher will be the memory used. Lower number, on the other hand, may not utilise the number of threads efficiently. As a reference, for the whole human genome we used -p 96
on a 48 core machine which used about 380G RAM and finished its run in about 3 hours (only Illumina polishing). We recommend using -p
to be a multiple of -t
. If the genome size is not too large for the available RAM, we recommend processing all the contigs in a single batch (i.e. avoid specifying -p
).
The option --ned-th
(or -n
) is relevant only when long reads polishing is also being done. It sets the threshold for the Normalised Edit Distance which measures similarity between a long read and the corresponding draft segment to which it is aligned. Formally, Normalised edit distance = (1bp edit distance between aligned read_segment and draft /aligned_length of draft)x100
Higher the threshold, more read-segments will be taken into consideration. We recommend using the default value for the whole human genome polishing.
Assuming $R1
, $R2
, $Draft
contain the names of the files containing short reads (paired-end) and draft contigs, respectively. Let $NUMTH
represents the number of threads to be used.
minimap2 --secondary=no --MD -ax sr -t $NUMTH $DRAFT $R1 $R2 | samtools view -Sb - > mapped-sr.bam
samtools sort -@$NUMTH -o mapped-sr.sorted.bam mapped-sr.bam
samtools index mapped-sr.sorted.bam
rm mapped-sr.bam
Assuming $LONGR
and $Draft
contain the names of the files containing the long reads (PacBio or ONT) and draft contigs, respectively. Let $NUMTH
represents the number of threads to be used. $RTYPE
will be map-pb
for PacBio and map-ont
for ONT reads.
minimap2 --secondary=no --MD -ax $RTYPE -t $NUMTH $DRAFT $LONGR | samtools view -Sb - > mapped-lg.bam
samtools sort -@$NUMTH -o mapped-lg.sorted.bam mapped-lg.bam
samtools index mapped-lg.sorted.bam
rm mapped-lg.bam
Create a text file containing the names of the short reads files.
echo -e "$R1\n$R2" > il_names.txt
Let genome size be around 3Gbp and average coverage of Illumina reads is around 55. Say, we would like to use 48 threads and process 96 contigs in a single batch.
A sample run of Hypo (for only short reads polishing) can be:
./bin/hypo -d $DRAFT -r @il_names.txt -s 3g -c 55 -b mapped-sr.sorted.bam -p 96 -t 48 -o whole_genome.h.fa
A sample run of Hypo (for short reads as well as long reads polishing) can be:
./bin/hypo -d $DRAFT -r @il_names.txt -s 3g -c 55 -b mapped-sr.sorted.bam -B mapped-lg.sorted.bam -p 96 -t 48 -o whole_genome.h2.fa
Assuming $READS
and $Draft
contain the names of the files containing CCS reads and draft contigs, respectively. Let $NUMTH
represents the number of threads to be used.
minimap2 --secondary=no --MD -ax asm20 -t $NUMTH $DRAFT $READS | samtools view -Sb - > mapped-ccs.bam
samtools sort -@$NUMTH -o mapped-ccs.sorted.bam mapped-ccs.bam
samtools index mapped-ccs.sorted.bam
rm mapped-ccs.bam
Let genome size be around 3Gbp and average coverage of CCS reads is around 30. Say, we would like to use 48 threads and process all the contigs in a single batch. A sample run of Hypo (for only CCS polishing) can be:
./bin/hypo -d $DRAFT -r $READS -s 3g -c 30 -b mapped-ccs.sorted.bam -t 48 -o whole_genome.h.fa
For the whole human genome (HG002) assembly, Hypo took about 3 hours and about 380G RAM to polish (on a 48 cores machine with 48 threads) using only Illumina reads. For polishing using Illumina as well as PacBio reads, time taken was about 4 hours and 15 minutes using about 410G RAM. The method and partial results can be found at BioRxiv.
HyPo: Super Fast & Accurate Polisher for Long Read Genome Assemblies
Ritu Kundu, Joshua Casey, Wing-Kin Sung
bioRxiv 2019.12.19.882506; doi: https://doi.org/10.1101/2019.12.19.882506
Other than raising issues on Github, you can contact Ritu Kundu (dcsritu@nus.edu.sg) or Joshua Casey (joshuac@comp.nus.edu.sg) for getting help in installation/usage or any other related query.