NucMerge improves genome assembly accuracy by incorporating information derived from an alternative assembly and paired-end Illumina reads from the same genome. It corrects insertion, deletion, substitution, and inversion errors and locates inter- and intra-chromosomal rearrangement errors. The tool is described in the manuscript mentioned in Section 6.
NucMerge can be run on Linux and Mac OS.
Tools that should be preinstalled and added to the PATH before running NucMerge:
NucBreak (https://github.com/uio-bmi/NucBreak) is provided together with NucMerge.
NucMerge was tested using Python 2.7, Pilon v1.22, NucDiff v2.0.2, NucBreak v1.0, bwa v0.7.5, samtools v.1.3.1, bowtie2 2.2.9, and MUMmer 3.23.
Clone the NucMerge github repository using the following command:
git clone --recursive https://github.com/uio-bmi/NucMerge.git
To run NucMerge, run nucmerge.py
with valid input arguments:
python nucmerge.py [-h] [--proc [int]] [--version]
Target_assembly.fasta Query_assembly.fasta PE_reads_1.fastq PE_reads_2.fastq Output_dir Prefix
Positional arguments:
Optional arguments:
A running example with the NucMerge predefined parameter values:
python nucmerge.py my_target_asmb.fasta my_query_asmb.fasta my_pe_reads_1.fastq my_pe_reads_2.fastq my_output_dir my_prefix
A running example with the introduced --proc parameter value:
python nucmerge.py --proc 1 my_target_asmb.fasta my_query_asmb.fasta my_pe_reads_1.fastq my_pe_reads_2.fastq my_output_dir my_prefix
NucMerge stores the output results produced by NucDiff, NucBreak, and Pilon in the following directories:
<output_dir>/NucDiff
<output_dir>/NucBreak_1
<output_dir>/NucBreak_2
<output_dir>/Pilon_1
<output_dir>/Pilon_2
NucMerge produces the following files stored in <output_dir>
:
The file contains information about the different types of insertion, deletion, and substitution errors detected in the target assembly.
The following information is contained in the file:
The description of the query_snps.gff and query_struct.gff files produced by NucDiff and all possible error types can be found at https://github.com/uio-cels/NucDiff/wiki.
The ‹Prefix›_local_differences.gff file example:
##gff-version 3
##sequence-region NODE_1 1 273095
NODE_1 NucMerge_v1.0 SO:1000002 27951 27951 . . . ID=LD_1;ID_nucdiff=SNP_4;Name=substitution;old_len=1;new_len=1;old_seq=C;new_seq=G;color=#42C042
NODE_1 NucMerge_v1.0 SO:0000667 129759 129759 . . . ID=LD_2;ID_nucdiff=SNP_11;Name=insertion;old_len=1;new_len=0;old_seq=G;new_seq=.;color=#EE0000
NODE_1 NucMerge_v1.0 SO:0000667 233592 233601 . . . ID=LD_3;ID_nucdiff=SNP_27;Name=inserted_gap;old_len=10;new_len=0;old_seq=NNNNNNNNNN;new_seq=.;color=#EE0000
##sequence-region NODE_2 1 211125
NODE_2 NucMerge_v1.0 SO:1000035 139350 139382 . . . ID=LD_4;ID_nucdiff=SV_21;Name=duplication;old_len=33;new_len=0;old_seq=CCCGGGAGCATAGATAACTATGTGACCGGGGTG;new_seq=.;color=#EE0000
NODE_2 NucMerge_v1.0 SO:0000159 173435 173435 . . . ID=LD_5;ID_nucdiff=SV_33;Name=collapsed_tandem_repeat;old_len=0;new_len=20;old_seq=.;new_seq=AGCCAGCGGCTGTTTGTCAG;color=#0000EE
...
The file contains information about inversion errors and structural breakpoints corresponding to inter- and intra-chromosomal rearrangement errors detected in the target assembly.
The following information is contained in the file:
The description of the query_struct.gff file produced by NucDiff and all possible error types can be found at https://github.com/uio-cels/NucDiff/wiki.
The ‹Prefix›_structural_differences.gff file example:
##gff-version 3
##sequence-region NODE_1 1 617
NODE_1 NucMerge_v1.0 SO:0000699 331 430 . . . ID=SD_1;Name=breakpoint;ID_nucdiff=SV_149;Type_nucdiff=translocation-inserted_gap;color=#0000EE
##sequence-region NODE_2 1 4763
NODE_2 NucMerge_v1.0 SO:0000699 4478 4478 . . . ID=SD_2;Name=breakpoint;ID_nucdiff=SV_174;Type_nucdiff=reshuffling-part_1_gr_0;color=#0000EE
##sequence-region NODE_3 1 208973
NODE_3 NucMerge_v1.0 SO:1000036 418 1022 . . . ID=SD_3;Name=inversion;ID_nucdiff=SV_317;Type_nucdiff=inversion;color=#EE0000
NODE_3 NucMerge_v1.0 SO:0000699 71741 71926 . . . ID=SD_4;Name=breakpoint;ID_nucdiff=SV_2577;Type_nucdiff=translocation-inserted_gap;color=#0000EE
NODE_3 NucMerge_v1.0 SO:0000699 110857 110857 . . . ID=SD_5;Name=breakpoint;ID_nucdiff=SV_2629;Type_nucdiff=reshuffling-part_2_gr_1;color=#0000EE
NODE_3 NucMerge_v1.0 SO:0000699 110857 110857 . . . ID=SD_6;Name=breakpoint;ID_nucdiff=SV_2630;Type_nucdiff=inversion;color=#0000EE
...
The file contains the resulted assembly obtained from the target assembly by (1) correcting inversion errors and errors listed in ‹Prefix›_local_differences.gff and (2) splitting target assembly sequences in the regions contained breakpoints from ‹Prefix›_structural_differences.gff.
To cite your use of NucMerge in your publication :
Khelik K., et al. NucMerge: Genome assembly quality improvement assisted by alternative assemblies and paired-end Illumina reads. (in preparation)