ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
504 stars 92 forks source link

Wtdbg2 assembly causes duplicated and flipped region in genome #124

Closed aakanshajr closed 4 years ago

aakanshajr commented 5 years ago

Hi, I'm new to Github and wtdbg2. I have assembled a bacterial genome of size 6.3m using the command:

!/bin/bash

/path/to/wtdbg2 -x ont -g 6.3m -t 16 -i /path/to/del_pobA_ICE_nanopore.fastq -fo prefix /path/to/wtpoa-cns -t 16 -i prefix.ctg.lay.gz -fo prefix.ctg.fa

I run this using Slurm with the command sbatch --cpus-per-task=10 --mem=20g --time=1:00:00

I get an assembly of two contigs

ctg1 len=6399902 ctg2 len=44811

I am using sequencing to identify mutations in my current strain for comparison to the parent strain. I am interested in a 130kb region which seems to be present however the first 15kb are flipped and also repeated about 10-15kb downstream of my 130kb sequence of interest. This is the output I get:

-- total memory 131906316.0 kB -- available 125913020.0 kB -- 24 cores -- Starting program: /home/ajr206/wtdbg2/wtdbg2 -x ont -g 6.3m -t 16 -i /scratch/ajr206/del_pobA_ICE_nanopore.fastq -fo prefix -- pid 4726 -- date Sun Jun 16 19:07:43 2019

[Sun Jun 16 19:07:43 2019] loading reads ^M0^M10000^M20000^M30000^M40000^M50000^M60000^M70000^M80000^M90000^M100000^M110000^M120000^M130000^M140000^M150000^M160000^M162819 reads [Sun Jun 16 19:07:52 2019] filtering from 162819 reads (>=5000 bp), 3049779269 bp. Try selecting 315000000 bp [Sun Jun 16 19:07:52 2019] Done, 5131 reads (>=5000 bp), 315000064 bp, 1230270 bins PROC_STAT(0) : real 9.521 sec, user 9.340 sec, sys 1.940 sec, maxrss 842356.0 kB, maxvsize 1003780.0 kB [Sun Jun 16 19:07:52 2019] Set --edge-cov to 3 KEY PARAMETERS: -k 15 -p 0 -K 1000.049988 -A -S 2.000000 -s 0.050000 -g 6300000 -X 50.000000 -e 3 -L 5000 [Sun Jun 16 19:07:52 2019] generating nodes, 16 threads [Sun Jun 16 19:07:52 2019] indexing bins[(0,1230270)/1230270] (314949120/3028952576 bp), 16 threads [Sun Jun 16 19:07:52 2019] - scanning kmers (K15P0S2.00) from 1230270 bins ^M0^M100000^M200000^M300000^M400000^M500000^M600000^M700000^M800000^M900000^M1000000^M1100000^M1200000^M1230270 bins

** 1 - 201 ** Quatiles: 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 1 1 2 3 6 14 20 28 38 48 PROC_STAT(0) : real 17.349 sec, user 57.180 sec, sys 3.860 sec, maxrss 2225480.0 kB, maxvsize 3538424.0 kB [Sun Jun 16 19:08:00 2019] - high frequency kmer depth is set to 1000 [Sun Jun 16 19:08:00 2019] - Total kmers = 51621862 [Sun Jun 16 19:08:00 2019] - average kmer depth = 6 [Sun Jun 16 19:08:00 2019] - 32812768 low frequency kmers (<2) [Sun Jun 16 19:08:00 2019] - 34 high frequency kmers (>1000) [Sun Jun 16 19:08:00 2019] - indexing 18809060 kmers, 114320435 instances (at most) ^M0^M100000^M200000^M300000^M400000^M500000^M600000^M700000^M800000^M900000^M1000000^M1100000^M1200000^M1230270 bins [Sun Jun 16 19:08:08 2019] - indexed 18809060 kmers, 114307007 instances [Sun Jun 16 19:08:08 2019] - masked 357 bins as closed [Sun Jun 16 19:08:08 2019] - sorting PROC_STAT(0) : real 25.266 sec, user 122.560 sec, sys 5.240 sec, maxrss 2225480.0 kB, maxvsize 3538424.0 kB [Sun Jun 16 19:08:08 2019] Done ^M0|0^M2000|274181^M4000|391750^M5130 reads|total hits 408303 PROC_STAT(0) : real 65.049 sec, user 512.730 sec, sys 10.370 sec, maxrss 2330992.0 kB, maxvsize 4858748.0 kB [Sun Jun 16 19:08:48 2019] sorting rdhits ... Done [Sun Jun 16 19:08:48 2019] clipping ... 1.45% bases [Sun Jun 16 19:08:48 2019] generating regs ... 16915346 [Sun Jun 16 19:08:49 2019] sorting regs ... Done [Sun Jun 16 19:08:49 2019] generating intervals ... 301844 intervals [Sun Jun 16 19:08:49 2019] selecting important intervals from 301844 intervals [Sun Jun 16 19:08:52 2019] Intervals: kept 6390, discarded 295454 PROC_STAT(0) : real 69.155 sec, user 522.740 sec, sys 11.190 sec, maxrss 2330992.0 kB, maxvsize 4858748.0 kB [Sun Jun 16 19:08:52 2019] Done, 6390 nodes [Sun Jun 16 19:08:52 2019] output "prefix.1.nodes". Done. [Sun Jun 16 19:08:52 2019] median node depth = 47 [Sun Jun 16 19:08:52 2019] masked 0 high coverage nodes (>200 or <3) [Sun Jun 16 19:08:52 2019] masked 2 repeat-like nodes by local subgraph analysis [Sun Jun 16 19:08:52 2019] generating edges [Sun Jun 16 19:08:53 2019] Done, 469827 edges [Sun Jun 16 19:08:53 2019] output "prefix.1.reads". Done. [Sun Jun 16 19:08:53 2019] output "prefix.1.dot.gz". Done. [Sun Jun 16 19:08:54 2019] graph clean [Sun Jun 16 19:08:54 2019] rescued 0 low cov edges [Sun Jun 16 19:08:54 2019] deleted 0 binary edges [Sun Jun 16 19:08:54 2019] deleted 1 isolated nodes [Sun Jun 16 19:08:54 2019] cut 1521 transitive edges [Sun Jun 16 19:08:54 2019] output "prefix.2.dot.gz". Done. [Sun Jun 16 19:08:54 2019] 78 bubbles; 2 tips; 1 yarns; rescued 77 high edges [Sun Jun 16 19:08:54 2019] deleted 337 isolated nodes [Sun Jun 16 19:08:54 2019] output "prefix.3.dot.gz". Done. [Sun Jun 16 19:08:54 2019] cut 10 branching nodes [Sun Jun 16 19:08:54 2019] deleted 2 isolated nodes [Sun Jun 16 19:08:54 2019] building unitigs [Sun Jun 16 19:08:54 2019] TOT 6318848, CNT 5, AVG 1263770, MAX 6139904, N50 6139904, L50 1, N90 53248, L90 2, Min 30464 [Sun Jun 16 19:08:54 2019] output "prefix.frg.nodes". Done. [Sun Jun 16 19:08:54 2019] generating links [Sun Jun 16 19:08:54 2019] generated 11 links [Sun Jun 16 19:08:54 2019] output "prefix.frg.dot.gz". Done. [Sun Jun 16 19:08:54 2019] rescue 0 weak links [Sun Jun 16 19:08:54 2019] deleted 7 binary links [Sun Jun 16 19:08:54 2019] cut 1 transitive links [Sun Jun 16 19:08:54 2019] remove 0 boomerangs [Sun Jun 16 19:08:54 2019] remove 1 weak branches [Sun Jun 16 19:08:54 2019] cut 0 tips [Sun Jun 16 19:08:54 2019] pop 0 bubbles [Sun Jun 16 19:08:54 2019] detached 1 repeat-associated paths [Sun Jun 16 19:08:54 2019] cut 0 tips [Sun Jun 16 19:08:54 2019] output "prefix.ctg.dot.gz". Done. [Sun Jun 16 19:08:54 2019] building contigs [Sun Jun 16 19:08:54 2019] searched 2 contigs [Sun Jun 16 19:08:54 2019] Estimated: TOT 6367488, CNT 2, AVG 3183744, MAX 6317312, N50 6317312, L50 1, N90 50176, L90 2, Min 50176 [Sun Jun 16 19:09:02 2019] output 2 contigs [Sun Jun 16 19:09:02 2019] Program Done PROC_STAT(TOTAL) : real 79.268 sec, user 592.350 sec, sys 11.810 sec, maxrss 2330992.0 kB, maxvsize 4858748.0 kB

-- -- total memory 131906316.0 kB -- available 125913628.0 kB -- 24 cores -- Starting program: /home/ajr206/wtdbg2/wtpoa-cns -t 16 -i prefix.ctg.lay.gz -fo prefix.ctg.fa -- pid 5092 -- date Sun Jun 16 19:09:02 2019

^M1 contigs 100 edges 0 bases^M1 contigs 200 edges 0 bases^M1 contigs 300 edges 0 bases^M1 contigs 400 edges 0 bases^M1 contigs 500 edges 0 bases^M1 contigs 600 edges 0 bases^M1 contigs 700 edges 0 bases^M1 contigs 800 edges 0 bases^M1 contigs 900 edges 0 bases^M1 contigs 1000 edges 0 bases^M1 contigs 1100 edges 0 bases^M1 contigs 1200 edges 0 bases^M1 contigs 1300 edges 0 bases^M1 contigs 1400 edges 0 bases^M1 contigs 1500 edges 0 bases^M1 contigs 1600 edges 0 bases^M1 contigs 1700 edges 0 bases^M1 contigs 1800 edges 0 bases^M1 contigs 1900 edges 0 bases^M1 contigs 2000 edges 0 bases^M1 contigs 2100 edges 0 bases^M1 contigs 2200 edges 0 bases^M1 contigs 2300 edges 0 bases^M1 contigs 2400 edges 0 bases^M1 contigs 2500 edges 0 bases^M1 contigs 2600 edges 0 bases^M1 contigs 2700 edges 0 bases^M1 contigs 2800 edges 0 bases^M1 contigs 2900 edges 0 bases^M1 contigs 3000 edges 0 bases^M1 contigs 3100 edges 0 bases^M1 contigs 3200 edges 0 bases^M1 contigs 3300 edges 0 bases^M1 contigs 3400 edges 0 bases^M1 contigs 3500 edges 0 bases^M1 contigs 3600 edges 0 bases^M1 contigs 3700 edges 0 bases^M1 contigs 3800 edges 0 bases^M1 contigs 3900 edges 0 bases^M1 contigs 4000 edges 0 bases^M1 contigs 4100 edges 0 bases^M1 contigs 4200 edges 0 bases^M1 contigs 4300 edges 0 bases^M1 contigs 4400 edges 0 bases^M1 contigs 4500 edges 0 bases^M1 contigs 4600 edges 0 bases^M1 contigs 4700 edges 0 bases^M1 contigs 4800 edges 0 bases^M1 contigs 4900 edges 0 bases^M1 contigs 5000 edges 0 bases^M1 contigs 5100 edges 0 bases^M1 contigs 5200 edges 0 bases^M1 contigs 5300 edges 0 bases^M1 contigs 5400 edges 0 bases^M1 contigs 5500 edges 0 bases^M1 contigs 5600 edges 0 bases^M1 contigs 5700 edges 0 bases^M1 contigs 5800 edges 0 bases^M1 contigs 5900 edges 0 bases^M1 contigs 6000 edges 0 bases^M2 contigs 6025 edges PROC_STAT(TOTAL) : real 99.274 sec, user 951.650 sec, sys 7.310 sec, maxrss 1545128.0 kB, maxvsize 2747852.0 kB

Please advise on what changes I can make to improve my alignment. I have reduced S to 1, changed -p 13 -k 9 -S 2 and -p 19 -k 3 and -k 19 -p 3 -S 1; I have also tried using a subset of the reads and Canu corrected reads with the corr preset but nothing gives me to expected alignment. Thank you!

aakanshajr commented 5 years ago

I am sorry I don't know how to use github well. Hope you can help with my above questions @ruanjue Thank you!

ruanjue commented 5 years ago

Polish the contigs with wtdbg2 or other tools. If uses wtdbg2, please have a look at https://github.com/ruanjue/wtdbg2#getting-started

# polish consensus, not necessary if you want to polish the assemblies using other tools
minimap2 -t16 -ax map-pb -r2k dbg.raw.fa reads.fa.gz | samtools sort -@4 >dbg.bam
samtools view -F0x900 dbg.bam | ./wtpoa-cns -t 16 -d dbg.raw.fa -i - -fo dbg.cns.fa
aakanshajr commented 5 years ago

I polished it using medaka after the assembly using wtdbg2. I will try to re-polish it using wtdbg2. Please let me know if you have any other suggestions. Thank you!

aakanshajr commented 5 years ago

Hi @ruanjue I ran wtdbg2 on the assembly as suggested and the issue isn't resolved. I ran the command:

!/bin/bash

minimap2 -t16 -ax map-ont -r2k /path/to/prefix.ctg.fa /path/to/del_pobA_ICE_nanopore.fastq | samtools sort -@4 >dbg.bam samtools view -F0x900 dbg.bam | wtpoa-cns -t 16 -d /path/to/prefix.ctg.fa -i - -fo dbg.cns.fa

and got the output: [M::mm_idx_gen::0.3550.70] collected minimizers [M::mm_idx_gen::0.4371.36] sorted minimizers [M::main::0.4371.36] loaded/built the index for 2 target sequence(s) [M::mm_mapopt_update::0.4591.34] mid_occ = 10 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 2 [M::mm_idx_stat::0.4761.33] distinct minimizers: 1121039 (94.92% are singletons); average occurrences: 1.074; average spacing: 5.354 [M::worker_pipeline::61.0536.56] mapped 40976 sequences [M::worker_pipeline::91.7247.60] mapped 41257 sequences [M::worker_pipeline::143.0877.92] mapped 40728 sequences [M::worker_pipeline::151.8607.88] mapped 41058 sequences [M::worker_pipeline::224.2927.61] mapped 41030 sequences [M::worker_pipeline::233.0247.64] mapped 40893 sequences [M::worker_pipeline::241.4577.60] mapped 19444 sequences [M::main] Version: 2.16-r922 [M::main] CMD: /home/ajr206/minimap2/minimap2-2.16_x64-linux/minimap2 -t16 -ax map-ont -r2k /scratch/ajr206/wtdbg2/wtdbg2_1/prefix.ctg.fa /scratch/ajr206/del_pobA_ICE_nanopore.fastq [M::main] Real time: 241.514 sec; CPU: 1836.332 sec; Peak RSS: 8.396 GB [bam_sort_core] merging from 12 files...

-- total memory 131993804.0 kB -- available 125237908.0 kB -- 16 cores -- Starting program: /home/ajr206/wtdbg2/wtpoa-cns -t 16 -d /scratch/ajr206/wtdbg2/wtdbg2_1/prefix.ctg.fa -i - -fo dbg.cns.fa -- pid 29365 -- date Tue Jun 25 23:06:25 2019

^M1 contigs 100 edges 0 bases^M1 contigs 200 edges 0 bases^M1 contigs 300 edges 0 bases^M^M1 contigs 400 edges 0 bases^M1 contigs 500 edges 0 bases^M1 contigs 600 edges 0 bases^M^M1 contigs 700 edges 0 bases^M1 contigs 800 edges 0 bases^M1 contigs 900 edges 0 bases^M^M1 contigs 1000 edges 0 bases^M1 contigs 1100 edges 0 bases^M1 contigs 1200 edges 0 bases^M1 contigs 1300 edges 0 bases^M1 contigs 1400 edges 0 bases^M1 contigs 1500 edges 0 bases^M1 contigs 1600 edges 0 bases^M1 contigs 1700 edges 0 bases^M1 contigs 1800 edges 0 bases^M1 contigs 1900 edges 0 bases^M1 contigs 2000 edges 0 bases^M1 contigs 2100 edges 0 bases^M1 contigs 2200 edges 0 bases^M1 contigs 2300 edges 0 bases^M1 contigs 2400 edges 0 bases^M1 contigs 2500 edges 0 bases^M1 contigs 2600 edges 0 bases^M1 contigs 2700 edges 0 bases^M1 contigs 2800 edges 0 bases^M1 contigs 2900 edges 0 bases^M1 contigs 3000 edges 0 bases^M1 contigs 3100 edges 0 bases^M1 contigs 3200 edges 0 bases^M1 contigs 3300 edges 0 bases^M1 contigs 3400 edges 0 bases^M1 contigs 3500 edges 0 bases^M1 contigs 3600 edges 0 bases^M1 contigs 3700 edges 0 bases^M1 contigs 3800 edges 0 bases^M1 contigs 3900 edges 0 bases^M1 contigs 4000 edges 0 bases^M1 contigs 4100 edges 0 bases^M1 contigs 4200 edges 0 bases^M2 contigs 4300 edges PROC_STAT(TOTAL) : real 129.582 sec, user 1120.440 sec, sys 11.680 sec, maxrss 799180.0 kB, maxvsize 2329336.0 kB

Since I have a region duplicated and reversed I think this should be a problem with the assembly and not polishing. I'd be grateful for any suggestions on parameters I can change? Thank you for your time and a great program!

ruanjue commented 5 years ago

Since you know the reversed localtion and have mapped raw reads to contigs, please have a look at the reads alignments. BTW, as a circled genome, sequence alignment SHOULD aware of the breakpoint of circle.

aakanshajr commented 5 years ago

Do you mean I should manually look at the vcf and edit the final sequence? what would be the best way to check the read alignments? Do you suggest I try to use something like Circlator (http://sanger-pathogens.github.io/circlator/) Thanks for you help!

ruanjue commented 5 years ago

minimap2 can give PAF, which can be read manually.

aakanshajr commented 5 years ago

Hi @ruanjue I generated a paf file but I am not sure how to visualize the data. I generated vcf files and visualized them with ribbon to give me the following alignments.

It tells me that a part of the read aligns in the forward orientation (blue) and a part of it aligns in the reverse orientation (red). I think this means a misalignment. Do you have any suggestion on how to avoid this.

Or maybe I can pull out all the reads that align to my mobile genetic element and assemble it separately (because I think the issue is because they may be a part of a mobile genetic element)

Thanks! Screen Shot 2019-07-18 at 10 35 05 AM

ruanjue commented 5 years ago

First, please make sure that it is not caused by circle genome which can give a breakpoint when aligning the linear contig sequence.

aakanshajr commented 5 years ago

Do you have any advice on how I can check that? When I annotated the assembly the proteins in this region do not match the proteins at the other end.

ruanjue commented 5 years ago

Please read https://github.com/rrwick/Long-read-assembler-comparison#assessing-chromosome-contiguity, ryan wick described it in detail.