rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
535 stars 132 forks source link

Does unicycler.log contains information about genome coverage? #323

Open FA387 opened 10 months ago

FA387 commented 10 months ago

Hi I am new to bioinfroamtics.

Does the Unicycler log contains information that said about genome coverage. I want to know since it is required by NCBI when submitting the genome. Previously I used flye, the genome coverage was shown in the flye log.

Starting Unicycler (2023-08-09 15:05:21)

Welcome to Unicycler, an assembly pipeline for bacterial genomes. Since you provided both short and long reads, Unicycler will perform a hybrid assembly. It will first use SPAdes to make a short-read assembly graph, and then it will use various methods to scaffold that graph with the long reads.
For more information, please see https://github.com/rrwick/Unicycler

Command: /data/users/avi123/.conda/envs/unicycler/bin/unicycler -1 MNP_B10_Aucticoccus_1.fastq.gz -2 MNP_B10_Aucticoccus__2.fastq.gz -l chop_b10t2.fastq -o MNP_M23-Aucticoccus_unicycler_hybrid

Unicycler version: v0.5.0 Using 8 threads

The output directory already exists: /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid

Dependencies: Program Version Status spades.py 3.15.5 good
racon 1.5.0 good
makeblastdb 2.14.0+ good
tblastn 2.14.0+ good

Choosing k-mer range for assembly (2023-08-09 15:06:25)

Unicycler chooses a k-mer range for SPAdes based on the length of the input reads. It uses a wide range of many k-mer sizes to maximise the chance of finding an ideal assembly.

SPAdes maximum k-mer: 127 Median read length: 151 K-mer range: 27, 53, 71, 87, 99, 111, 119, 127

SPAdes assemblies (2023-08-09 15:07:30)

Unicycler now uses SPAdes to assemble the short reads. It scores the assembly graph for each k-mer using the number of contigs (fewer is better) and the number of dead ends (fewer is better). The score function is 1/(c*(d+2)), where c is the contig count and d is the dead end count.

spades.py -o /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/spades_assembly -k 27 --threads 8 --isolate -1 /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_B10_Aucticoccus_1.fastq.gz -2 /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_B10_Aucticoccus__2.fastq.gz -m 1024

spades.py -o /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/spades_assembly -k 27,53 --threads 8 --restart-from k27 -m 1024

spades.py -o /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/spades_assembly -k 27,53,71 --threads 8 --restart-from k53 -m 1024

spades.py -o /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/spades_assembly -k 27,53,71,87 --threads 8 --restart-from k71 -m 1024

spades.py -o /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/spades_assembly -k 27,53,71,87,99 --threads 8 --restart-from k87 -m 1024

spades.py -o /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/spades_assembly -k 27,53,71,87,99,111 --threads 8 --restart-from k99 -m 1024

spades.py -o /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/spades_assembly -k 27,53,71,87,99,111,119 --threads 8 --restart-from k111 -m 1024

spades.py -o /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/spades_assembly -k 27,53,71,87,99,111,119,127 --threads 8 --restart-from k119 -m 1024

K-mer Contigs Dead ends Score
27 too complex 53 237 0 2.11e-03 71 155 0 3.23e-03 87 119 0 4.20e-03 99 90 0 5.56e-03 111 70 0 7.14e-03 119 65 0 7.69e-03 <-best 127 65 0 7.69e-03

Read depth filter: removed 89 contigs totalling 28272 bp Deleting /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/spades_assembly/

Determining graph multiplicity (2023-08-09 17:17:58)

Multiplicity is the number of times a sequence occurs in the underlying sequence. Single-copy contigs (those with a multiplicity of one, occurring only once in the underlying sequence) are particularly useful.

Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/002_depth_filter.gfa

Cleaning graph (2023-08-09 17:17:58)

Unicycler now performs various cleaning procedures on the graph to remove overlaps and simplify the graph structure. The end result is a graph ready for bridging.

Graph overlaps removed

Removed zero-length segments: 63

Merged small segments: 51, 52, 54, 55, 56, 57, 58, 59, 60, 61, 62, 64

Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/003_overlaps_removed.gfa

Unicycler now selects a set of anchor contigs from the single-copy contigs. These are the contigs which will be connected via bridges to form the final assembly.

24 anchor segments (5,102,105 bp) out of 52 total segments (5,136,915 bp)

Creating SPAdes contig bridges (2023-08-09 17:17:59)

SPAdes uses paired-end information to perform repeat resolution (RR) and produce contigs from the assembly graph. SPAdes saves the graph paths corresponding to these contigs in the contigs.paths file. When one of these paths contains two or more anchor contigs, Unicycler can create a bridge from the path.

                               Bridge

Start Path End quality 3 40 4 63.2 8 44 -> -35 -> 44 12 19.3 9 -40 -20 62.5 21 -43 20 61.3

Creating loop unrolling bridges (2023-08-09 17:17:59)

When a SPAdes contig path connects an anchor contig with the middle contig of a simple loop, Unicycler concludes that the sequences are contiguous (i.e. the loop is not a separate piece of DNA). It then uses the read depth of the middle and repeat contigs to guess the number of times to traverse the loop and makes a bridge.

                              Loop count   Loop count    Loop    Bridge

Start Repeat Middle End by repeat by middle count quality -12 -44 35 -8 0.79 0.98 1 43.5

Loading reads (2023-08-09 17:17:59)

136,900 / 136,900 (100.0%) - 852,900,591 bp

Assembling contigs and long reads with miniasm (2023-08-09 17:18:02)

Unicycler uses miniasm to construct a string graph assembly using both the short read contigs and the long reads. It will then use the resulting string graph to produce bridges between contigs. This method requires decent coverage of long reads and therefore may not be fruitful if long reads are sparse. However, it does not rely on the short read assembly graph having good connectivity and is able to bridge an assembly graph even when it contains many dead ends.
Unicycler uses two types of "reads" as assembly input: anchor contigs from the short-read assembly and actual long reads which overlap two or more of these contigs. It then assembles them with miniasm.

Aligning long reads to graph using minimap

Saving to /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/miniasm_assembly/01_assembly_reads.fastq: 20 short-read contigs 5,344 long reads

Finding overlaps with minimap... success 861,662 overlaps

Assembling reads with miniasm... success 167 segments, 166 links

Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/miniasm_assembly/11_branching_paths_removed.gfa Merging segments into unitigs: 5 circular unitigs 1 linear unitig total size = 5,105,379 bp Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/miniasm_assembly/12_unitig_graph.gfa

Polishing miniasm assembly with Racon (2023-08-09 17:18:55)

Unicycler now uses Racon to polish the miniasm assembly. It does multiple rounds of polishing to get the best consensus. Circular unitigs are rotated between rounds such that all parts (including the ends) are polished well.

Saving to /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/miniasm_assembly/racon_polish/polishing_reads.fastq: 20 short-read contigs 136,900 long reads

Polish Assembly Mapping round size quality begin 5,105,379 125,681.13 1 5,105,190 127,186.15

Best polish: /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/miniasm_assembly/racon_polish/006_rotated.fasta Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/miniasm_assembly/13_racon_polished.gfa Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/004_long_read_assembly.gfa

Contigs in the short-read assembly graph which end in dead ends may contain bogus sequence near the dead end. Unicycler therefore uses the read clipping values from the miniasm assembly to trim these dead ends to only the parts which aligned well to long reads.

No dead ends required trimming.

Unicycler now places the single copy contigs back into the unitig graph. This serves two purposes: a) it replaces long read assembly sequences (which may be error prone) with Illumina assembly sequence (which is probably quite accurate), improving the assembly quality, and b) it defines inter-contig sequences for use in building bridges.

Searching for contigs using 5,000 bp of contig ends.

Contig Result Start pos End pos Strand 1 found in unitig 1 2521370 3222095 + 2 found in unitig 1 1866242 2515677 + 3 found in unitig 1 -265163 341266 + 4 found in unitig 1 341415 925987 + 5 found in unitig 1 3869863 4284079 - 6 found in unitig 1 927184 1338127 - 7 found in unitig 1 3321239 3731692 + 8 found in unitig 1 1464199 1865045 - 9 found in unitig 1 4494427 4680217 - 10 found in unitig 1 4285276 4431772 - 11 found in unitig 1 3732889 3862694 + 12 found in unitig 1 1339324 1463277 - 13 found in unitig 1 3250706 3320042 + 16 found in unitig 3 -223 28517 - 18 found in unitig 1 3227316 3248352 - 19 found in unitig 1 4436994 4456821 + 20 found in unitig 1 4475077 4494278 + 21 found in unitig 1 4458018 4474948 + 22 found in unitig 4 -5237 11544 + 45 not found

Searching for contigs using 2,500 bp of contig ends.

Contig Result Start pos End pos Strand 45 not found

Searching for contigs using 1,000 bp of contig ends.

Contig Result Start pos End pos Strand 45 not found

Searching for contigs using 500 bp of contig ends.

Contig Result Start pos End pos Strand 45 not found

Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/miniasm_assembly/14_contigs_placed.gfa

Creating miniasm/Racon bridges (2023-08-09 17:25:01)

Now that the miniasm/Racon string graph is complete, Unicycler will use it to build bridges between anchor segments.

   Start -> end   Best path                                 Quality

1/19 3 -> 4 40 98.534 2/19 7 -> 11 -30 89.821 3/19 11 -> -5 -30, -37, -49, -41, 48, 28, 46, 36, 50, 64.088 -29
4/19 -5 -> -10 -30 89.804 5/19 -10 -> 19 -38, 33, -26 69.109 6/19 19 -> 21 -30 87.662 7/19 21 -> 20 -43 95.696 8/19 20 -> -9 40 97.329 9/19 -9 -> 3 29, -51, -36, -47, -28, 39, 41 68.238 10/19 -16 -> -16 -30, 25, 42 64.598 11/19 4 -> -6 -30 89.696 12/19 22 -> 22 -32, 27, -32, 24, 42 50.621 13/19 -6 -> -12 30 89.800 14/19 -12 -> -8 -44, 35, -44 91.793 15/19 -8 -> 2 -30 89.799 16/19 2 -> 1 29, -51, -36, -47, -28, -48, 41, 49 68.368 17/19 1 -> -18 -38, 34, -26 69.687 18/19 -18 -> 13 -43, 31, -30 81.682 19/19 13 -> 7 -30 89.784

Creating simple long read bridges (2023-08-09 17:25:03)

Unicycler uses long read alignments (from minimap) to resolve simple repeat structures in the graph. This takes care of some "low-hanging fruit" of the graph simplification.

Aligning long reads to graph using minimap

Two-way junctions are defined as cases where two graph contigs (A and B) join together (C) and then split apart again (D and E). This usually represents a simple 2-copy repeat, and there are two possible options for its resolution: (A->C->D and B->C->E) or (A->C->E and B->C->D). Each read which spans such a junction gets to "vote" for option 1, option 2 or neither. Unicycler creates a bridge at each junction for the most voted for option.

                                          Op. 1   Op. 2   Neither   Final    Bridge

Junction Option 1 Option 2 votes votes votes op. quality 40 3 -> 40 -> -9, 3 -> 40 -> 4, 0 243 1 2 98.4 20 -> 40 -> 4 20 -> 40 -> -9

Simple loops are parts of the graph where two contigs (A and B) are connected via a repeat (C) which loops back to itself (via D). It is possible to traverse the loop zero times (A->C->B), one time (A->C->D->C->B), two times (A->C->D->C->D->C->B), etc. Long reads which span the loop inform which is the correct number of times through. In this step, such reads are found and each is aligned against alternative loop counts. A reads casts its "vote" for the loop count it agrees best with, and Unicycler creates a bridge using the most voted for count.

                               Read                         Loop    Bridge

Start Repeat Middle End count Read votes count quality -12 -44 35 -8 103 1 loop: 102 votes 1 98.9 2 loops: 1 vote

Determining low score threshold (2023-08-09 17:25:45)

Before conducting semi-global alignment of the long reads to the assembly graph, Unicycler must determine a minimum alignment score threshold such that nonsense alignments are excluded. To choose a threshold automatically, it examines alignments between random sequences and selects a score a few standard deviations above the mean.

Automatically choosing a threshold using random alignment scores.

Random alignment mean score: 61.66 standard deviation: 1.31 Low score threshold: 61.66 + (7 x 1.31) = 70.86

Aligning reads (2023-08-09 17:25:51)

136,900 / 136,900 (100.0%)

Read alignment summary (2023-08-09 19:53:51)

Total read count: 136,900 Fully aligned reads: 134,935 Partially aligned reads: 1,344 Unaligned reads: 621 Total bases aligned: 852,220,970 bp Mean alignment identity: 97.6%

Deleting /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/read_alignment/

Building long read bridges (2023-08-09 19:53:52)

Unicycler uses the long read alignments to produce bridges between anchor segments. These bridges can be formed using as few as one long read, giving Unicycler the ability to bridge the graph even when long-read depth is low.

   Start -> end   Best path                                 Quality

1/20 6 -> -4 30 83.750 2/20 10 -> 5 30 83.580 3/20 7 -> 11 -30 83.734 4/20 19 -> 21 -30 80.270 5/20 12 -> 6 -30 83.654 6/20 -10 -> 19 -38, 33, -26 81.771 7/20 18 -> -1 26, -34, 38 82.756 8/20 -2 -> 8 30 83.674 9/20 2 -> 1 29, -51, -36, -47, -28, -48, 41, 49 83.825 10/20 13 -> 7 -30 83.749 11/20 20 -> -9 40 81.730 12/20 21 -> 20 -43 79.720 13/20 -3 -> 9 -41, -39, 28, 47, 36, 51, -29 83.049 14/20 3 -> 4 40 83.627 15/20 5 -> 3 29, -50, -36, -46, -28, -48, 41 18.233 16/20 8 -> 12 44, -35, 44 83.471 17/20 -13 -> 18 30, -31, 43 69.788 18/20 16 -> 16 -42, -25, 30 82.742 19/20 11 -> -5 -30, -37, -49, -41, 48, 28, 46, 36, 50, 83.482 -29
20/20 22 -> 22 -32, 27, -32, 24, 42 81.831

Applying bridges (2023-08-09 19:54:17)

Unicycler now applies to the graph in decreasing order of quality. This ensures that when multiple, contradictory bridges exist, the most supported option is used.

Bridge type Start -> end Path Quality simple long read 3 -> 4 40 99.580 simple long read -12 -> -8 -44, 35, -44 98.923 simple long read 20 -> -9 40 98.363 miniasm 21 -> 20 -43 95.696 miniasm 7 -> 11 -30 89.821 miniasm -5 -> -10 -30 89.804 miniasm -6 -> -12 30 89.800 miniasm -8 -> 2 -30 89.799 miniasm 13 -> 7 -30 89.784 miniasm 4 -> -6 -30 89.696 miniasm 19 -> 21 -30 87.662 long read 2 -> 1 29, -51, -36, -47, -28, -48, 41, 49 83.825 long read 11 -> -5 -30, -37, -49, -41, 48, 28, 46, 36, 50, 83.482 -29
long read -3 -> 9 -41, -39, 28, 47, 36, 51, -29 83.049 long read 18 -> -1 26, -34, 38 82.756 long read 16 -> 16 -42, -25, 30 82.742 long read 22 -> 22 -32, 27, -32, 24, 42 81.831 long read -10 -> 19 -38, 33, -26 81.771 miniasm -18 -> 13 -43, 31, -30 81.682

Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/005_bridges_applied.gfa

Bridged assembly graph (2023-08-09 19:54:18)

The assembly is now mostly finished and no more structural changes will be made. Ideally the assembly graph should now have one contig per replicon and no erroneous contigs (i.e. a complete assembly). If there are more contigs, then the assembly is not complete.

Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/006_final_clean.gfa

Component Segments Links Length N50 Longest segment Status
total 7 7 5,164,601 4,951,092 4,951,092
1 1 1 4,951,092 4,951,092 4,951,092 complete 2 1 1 56,789 56,789 56,789 complete 3 1 1 50,657 50,657 50,657 complete 4 1 1 35,767 35,767 35,767 complete 5 1 1 31,293 31,293 31,293 complete 6 1 1 24,077 24,077 24,077 complete 7 1 1 14,926 14,926 14,926 complete

Rotating completed replicons (2023-08-09 19:54:18)

Any completed circular contigs (i.e. single contigs which have one link connecting end to start) can have their start position changed without altering the sequence. For consistency, Unicycler now searches for a starting gene (dnaA or repA) in each such contig, and if one is found, the contig is rotated to start with that gene on the forward strand.

Segment Length Depth Starting gene Position Strand Identity Coverage 1 4,951,092 1.00x none found
2 56,789 1.46x none found
3 50,657 1.45x none found
4 35,767 1.32x none found
5 31,293 1.63x none found
6 24,077 1.95x none found
7 14,926 2.46x none found

Assembly complete (2023-08-09 19:58:21)

Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/assembly.gfa Saving /data/users/avi123/Genome_assembly/MNP-M23_B10_Aucticoccus/MNP_M23-Aucticoccus_unicycler_hybrid/assembly.fasta