uio-bmi / NucMerge

Genome assembly quality improvement assisted by alternative assemblies and paired-end Illumina reads
Mozilla Public License 2.0
7 stars 1 forks source link

NucMerge manual



1 Introduction

NucMerge improves genome assembly accuracy by incorporating information derived from an alternative assembly and paired-end Illumina reads from the same genome. It corrects insertion, deletion, substitution, and inversion errors and locates inter- and intra-chromosomal rearrangement errors. The tool is described in the manuscript mentioned in Section 6.



2 Prerequisites

NucMerge can be run on Linux and Mac OS.

Tools that should be preinstalled and added to the PATH before running NucMerge:

NucBreak (https://github.com/uio-bmi/NucBreak) is provided together with NucMerge.

NucMerge was tested using Python 2.7, Pilon v1.22, NucDiff v2.0.2, NucBreak v1.0, bwa v0.7.5, samtools v.1.3.1, bowtie2 2.2.9, and MUMmer 3.23.



3 Installation

Clone the NucMerge github repository using the following command:

git clone --recursive https://github.com/uio-bmi/NucMerge.git



4 Running

4.1 Command line syntax and input arguments

To run NucMerge, run nucmerge.py with valid input arguments:

python nucmerge.py [-h] [--proc [int]] [--version]
                   Target_assembly.fasta Query_assembly.fasta PE_reads_1.fastq PE_reads_2.fastq Output_dir Prefix

Positional arguments:

Optional arguments:

4.2 Running examples

A running example with the NucMerge predefined parameter values:

python nucmerge.py my_target_asmb.fasta my_query_asmb.fasta my_pe_reads_1.fastq my_pe_reads_2.fastq my_output_dir my_prefix

A running example with the introduced --proc parameter value:

python nucmerge.py --proc 1 my_target_asmb.fasta my_query_asmb.fasta my_pe_reads_1.fastq my_pe_reads_2.fastq my_output_dir my_prefix



5 NucMerge output

NucMerge stores the output results produced by NucDiff, NucBreak, and Pilon in the following directories:

NucMerge produces the following files stored in <output_dir>:

5.1 ‹Prefix›_local_differences.gff

The file contains information about the different types of insertion, deletion, and substitution errors detected in the target assembly.

The following information is contained in the file:

The description of the query_snps.gff and query_struct.gff files produced by NucDiff and all possible error types can be found at https://github.com/uio-cels/NucDiff/wiki.

The ‹Prefix›_local_differences.gff file example:

##gff-version 3
##sequence-region   NODE_1  1   273095
NODE_1  NucMerge_v1.0   SO:1000002  27951   27951   .   .   .   ID=LD_1;ID_nucdiff=SNP_4;Name=substitution;old_len=1;new_len=1;old_seq=C;new_seq=G;color=#42C042
NODE_1  NucMerge_v1.0   SO:0000667  129759  129759  .   .   .   ID=LD_2;ID_nucdiff=SNP_11;Name=insertion;old_len=1;new_len=0;old_seq=G;new_seq=.;color=#EE0000
NODE_1  NucMerge_v1.0   SO:0000667  233592  233601  .   .   .   ID=LD_3;ID_nucdiff=SNP_27;Name=inserted_gap;old_len=10;new_len=0;old_seq=NNNNNNNNNN;new_seq=.;color=#EE0000
##sequence-region   NODE_2  1   211125
NODE_2  NucMerge_v1.0   SO:1000035  139350  139382  .   .   .   ID=LD_4;ID_nucdiff=SV_21;Name=duplication;old_len=33;new_len=0;old_seq=CCCGGGAGCATAGATAACTATGTGACCGGGGTG;new_seq=.;color=#EE0000
NODE_2  NucMerge_v1.0   SO:0000159  173435  173435  .   .   .   ID=LD_5;ID_nucdiff=SV_33;Name=collapsed_tandem_repeat;old_len=0;new_len=20;old_seq=.;new_seq=AGCCAGCGGCTGTTTGTCAG;color=#0000EE
...

5.2 ‹Prefix›_structural_differences.gff

The file contains information about inversion errors and structural breakpoints corresponding to inter- and intra-chromosomal rearrangement errors detected in the target assembly.

The following information is contained in the file:

The description of the query_struct.gff file produced by NucDiff and all possible error types can be found at https://github.com/uio-cels/NucDiff/wiki.

The ‹Prefix›_structural_differences.gff file example:

##gff-version 3
##sequence-region   NODE_1  1   617
NODE_1  NucMerge_v1.0   SO:0000699  331 430 .   .   .   ID=SD_1;Name=breakpoint;ID_nucdiff=SV_149;Type_nucdiff=translocation-inserted_gap;color=#0000EE
##sequence-region   NODE_2  1   4763
NODE_2  NucMerge_v1.0   SO:0000699  4478    4478    .   .   .   ID=SD_2;Name=breakpoint;ID_nucdiff=SV_174;Type_nucdiff=reshuffling-part_1_gr_0;color=#0000EE
##sequence-region   NODE_3  1   208973
NODE_3  NucMerge_v1.0   SO:1000036  418 1022    .   .   .   ID=SD_3;Name=inversion;ID_nucdiff=SV_317;Type_nucdiff=inversion;color=#EE0000
NODE_3  NucMerge_v1.0   SO:0000699  71741   71926   .   .   .   ID=SD_4;Name=breakpoint;ID_nucdiff=SV_2577;Type_nucdiff=translocation-inserted_gap;color=#0000EE
NODE_3  NucMerge_v1.0   SO:0000699  110857  110857  .   .   .   ID=SD_5;Name=breakpoint;ID_nucdiff=SV_2629;Type_nucdiff=reshuffling-part_2_gr_1;color=#0000EE
NODE_3  NucMerge_v1.0   SO:0000699  110857  110857  .   .   .   ID=SD_6;Name=breakpoint;ID_nucdiff=SV_2630;Type_nucdiff=inversion;color=#0000EE
...

5.3 ‹Prefix›_nucmerge_asmb.fasta

The file contains the resulted assembly obtained from the target assembly by (1) correcting inversion errors and errors listed in ‹Prefix›_local_differences.gff and (2) splitting target assembly sequences in the regions contained breakpoints from ‹Prefix›_structural_differences.gff.

6 Citing NucMerge

To cite your use of NucMerge in your publication :

Khelik K., et al. NucMerge: Genome assembly quality improvement assisted by alternative assemblies and paired-end Illumina reads. (in preparation)