LRScaf: improving draft genomes using long noisy reads
Hybrid assembly strategy is a reasonable and promising approach to utilize strengths and settle weaknesses in Next-Generation Sequencing (NGS) and Third-Generation Sequencing (TGS) technologies. According to this principle, we here present a new toolkit named LRScaf (Long Reads Scaffolder) by applied TGS data to improve draft genome assembly. The main features are: short running time, accuracy, and being contiguity. To scaffold rice genome, it could be done in 20 mins with minimap mapper. In human, LRScaf could improve the draft assembly NG50 from 127.5 Kb to 10.4 Mb on 20x PacBio CHM1 dataset and NG50 from 115.7 Kb to 17.4 Mb on ~35x Nanopore NA12878 dataset.
################################################################################
Requirements
################################################################################
Java version: 1.8+.
################################################################################
Building LRScaf project
################################################################################
There are two ways to build and run this project:
>java -jar LRScaf-<version>.jar -x <configure.xml>.
>unzip lrscaf-<version>.zip
>cd lrscaf-<version>
>mvn package
################################################################################
Quick starting
################################################################################
>java -jar LRScaf-<version>.jar -x <configure.xml>
>java -jar LRScaf-<version>.jar -c <draft_assembly.fasta> -a <alignment.m4> -t <m4> -o <output_foloder> [options]
>java -jar LRScaf-<version>.jar --contig <draft_assembly.fasta> --alignedFile <alignment.m4> -t <m4> --output <output_foloder> [options]
################################################################################
A Oryza sativa L. Tutorials
################################################################################
>SOAPdenovo127mer pregraph -s ./assembly.config -d 1 -K 83 -R -p 48 -o ./83/83
>SOAPdenovo127mer contig -R -g ./83/83
>SOAPdenovo127mer map -p 48 -s ./assembly.config -g ./83/83
>SOAPdenovo127mer scaff -p 48 -L 150 -F -g ./83/83
>minimap2 -t 8 ./draft.fa ./tgs20x.fa >./aln.mm
>java -Xms100g -Xmx100g -jar LRScaf.jar -x ./scafconf.xml
################################################################################
Parameters of LRScaf
################################################################################
LRScaf supports parameters set by XML confiuration file or command-line. It recommends to use XML configuration file. There is a template configuration file of XML format, named "scafconf.xml", in the project. In command-line, LRScaf supports long (dash-dash) and short (dash) style of GNU like options. And the following table would show each parameter meaning and default value if available.
The first and second columns are the command-line paremeters in long and its coressponding short style.
The third column is the code in XML configuration file. NA is not available in XML configuration file.
The fourth column is the details and default value of this option if available.
Parameter | Abbreviation | XML Code | Details |
---|---|---|---|
xml | x | NA | The XML configuration file. All command-line parameters would be omitted if this is set. |
contig | c | contig | The contigs file of draft assembly in fasta format. |
m5 | m5 | m5 | The alignment file in -m 5 format of BLASR. |
m4 | m4 | m4 | The alignment file in -m 4 format of BLASR. |
sam | sam | sam | The alignment file in sam format of BLASR. |
mm | mm | mm | The alignment file in PAF format of Minimap. |
output | o | output | The output folder. |
miniCntLen | micl | min_contig_length | The minimum contigs length to be included for scaffolding. Default: <200> bp. |
identity | i | identity | The identity threshold for filtering invalid alignment. Default: <0.8>. This value must be modify according to the mapper. For the BLASR alignment file, the higher value means the higher identity. For the Minimap alignment file, the value should not be larger than 0.3 and the value could be set to 0.1. |
miniOLLen | mioll | min_overlap_length | The minimum overlap length of contig. Default: <160> bp. |
miniOLRatio | miolr | min_overlap_ratio | The minimum overlap length ratio of contig. Default: <0.8>. If the overlap length is large than the miniOLLen, it will compute the ratio of overlap length which is overlap_length/contig_length. |
maOHLen | maohl | max_overhang_length | The maximum overhang length of contig. Default: <300> bp. |
maOHRatio | maohr | max_overhang_ratio | The maximum overhang ratio of contig. Default: <0.1>. If the overhang length is less than the maohl, it will compute the ratio of overhang length which is overhang_lenght/contig_length. |
maELen | mael | max_end_length | The maximum ending length of long read. Default: <300> bp. |
maERatio | maer | max_end_ratio | The maximum ending ratio of long read. Default: <0.1>. It will compute the ending length (ending_len) by long_read_length * maer, then def_ending_len = (mael >= ending_len ? ending_len : mael). |
miSLN | misl | min_supported_links | The minimum support links. Default: <1>. If the depth of long reads less than 10x, the misl could be set to 1. |
ratio | r | ratio | The ratio for deleting error prone edges in divergence nodes. Default: <0.2>. |
mr | mr | repeat_mask | The indicator for masking repeats. Default: <true>. Masking repeats will reduce the divergent nodes in the scaffolding graph and improve the contiguity of assemblies. It recommends to be true. |
tiplength | tl | tip_length | The maximum tip length. Default: <1500> bp. |
iqrtime | iqrt | iqr_time | The IQR times for setting contigs as repeats by their coverages. Default: <1.5>. |
mmcm | mmcm | mmcm | The parameter to filter invalid Minimap alignments. Default: <8>. Only for Minimap alignment. |
process | p | process | The multi-threads settings. Default:<4>. |
help | h | NA | Print this help information. |
################################################################################
XML Configuration File Content
################################################################################
<?xml version="1.0" encoding="UTF-8"?>
<scaffold>
<!--The input file for scaffolding, including contigs and aligned files (i.e. m5, m4 or mm file) -->
<input>
<contig>Draft assembly in fasta format.</contig>
<m4>The aligned file in BLASR -m 4 format.</m4>
</input>
<!-- The output folder for scaffolding -->
<output>The output folder.</output>
<!-- The parameters for scaffolding-->
<paras>
<!--More details are showed in README.md-->
<min_contig_length>500</min_contig_length>
<identity>0.8</identity>
<min_overlap_length>400</min_overlap_length>
<min_overlap_ratio>0.8</min_overlap_ratio>
<max_overhang_length>500</max_overhang_length>
<max_overhang_ratio>0.1</max_overhang_ratio>
<max_end_length>500</max_end_length>
<max_end_ratio>0.1</max_end_ratio>
<min_supported_links>2</min_supported_links>
<tip_length>1500</tip_length>
<ratio>0.2</ratio>
<repeat_mask>true</repeat_mask>
<iqr_time>3</iqr_time>
<mmcm>8</mmcm> <!--only for Minimap Alignment.-->
<process>4</process>
</paras>
</scaffold>
################################################################################
Licence
################################################################################
LRScaf is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
If you have any questions, please feel free to contact me <qinmao@caas.cn>.