Closed ml3958 closed 6 years ago
The contiguity of the assembly depends primarily on the read length, repeat content of the genome, and heterozygosity. If these genomes have similar repeat content and you have relatively similar average read lengths, it could be the heterozygosity.
The higher repeat-cont fraction in the second genome indicates there are lots of reads of about 5-7kbp that have extremely high coverage in overlaps (>1000) which could either be a very abundant repeat in the genome (the normal coverage is 35x) or a contaminant in the sample. The GFA output (asm.unitigging.gfa) should have more information, if the graph looks like it has lots of alternate paths it is likely a heterozygous sample. I suggest running mash screen (http://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes) to see what is in the sample, it won't discriminate similar strains but will identify mixtures of bacteria/viruses.
If it is the heterozygosity, you can try varying the unitigging parameters, try the separation option from the FAQ ('batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50'
). You could also try smashing the heterozygous genomes together (corOutCoverage=100 overlapper=mhap utgReAlign=true correctedErrorRate=0.20 'batOptions=-dg 50 -db 50 -dr 1 -ca 500 -cp 50'
). This also turns on the faster overlapping algorithm because the default will be slow at this high error rate. Keep in mind though even if you can smash the assembly into a single contig, the consensus will likely be a mix of all variation in your sample.
Thanks so much for the input!
As you suggested, I tried mash screen. The results are very surprising! We're assembly the genome for Oxalobacter formigenes. With the data yielding a good assembly, the highest/dominant hit is a O. formienges.
0.998305 965/1000 49 0 GCF_000158495.1_ASM15849v1_genomic.fna.gz [9 seqs] NZ_GG658178.1 Oxalobacter formigenes OXCC13 genomic scaffold supercont1.9, whole genome shotgun sequence [...]
0.886148 79/1000 2066 1.12003e-201 GCF_000902695.1_ViralProj183144_genomic.fna.gz NC_019711.1 Enterobacteria phage HK629, complete genome
0.872481 57/1000 74 1.14783e-137 ref|NZ_CP006375.1| Aureimonas sp. AU20 plasmid pAU20rrn, complete sequence
0.837273 24/1000 34 9.10938e-50 ref|NZ_CP013748.1| Arthrobacter sulfonivorans strain
While in contrast, the other the data yielded a poor assembly had highest hits on phage, and plasmid.
0.889737 86/1000 5873 1.03654e-241 GCF_000903575.1_ViralProj183142_genomic.fna.gz NC_019723.1 Enterobacteria phage HK630, complete genome
0.870998 55/1000 40 3.49853e-144 ref|NZ_CP006375.1| Aureimonas sp. AU20 plasmid pAU20rrn, complete sequence
0.859885 42/1000 1 1.76113e-105 ref|NC_003789.1| Klebsiella sp. KCL-2 plasmid pMGD2, complete sequence
0.841982 27/1000 11 3.83478e-63 GCF_000158475.2_Oxal_for_HOxBLS_2_V2_genomic.fna.gz [2 seqs] NZ_KI392030.1 Oxalobacter formigenes HOxBLS genomic scaffold supercont2.1, whole genome shotgun sequence [...]
0.837273 24/1000 19 4.58581e-55 ref|NZ_CP013748.1| Arthrobacter sulfonivorans strain Ar51 plasmid, complete sequence
So I guess in this case, I should try to discard those reads first and then re-run Canu.
Thanks so much!
The other interesting thing is that the top hit shows 99% identity to Oxalobacter whereas the second sample is only 84%. The median multiplicity is also lower, down to 11 from 49. This multiplicity is measured using perfect k-mers so it is lower than true coverage due to sequencing error. However, this implies there is lower coverage of your target genome here or it's very diverse from what is in the DB. I would guess the super-high multiplicity Enterobacteria are lambda contamination. Removing the contamination can help, I'd also increase the corOutCoverage to 200 just to make sure you don't lose any data from your target genome. If you're able to share the reads (see FAQ for instructions to send it to us using FTP), we can take a look at the data here.
Thanks for pointing that out. You're exactly right. We believe we're assembly a novel strain that has not be characterized before, whose genome is different from the current ones even for same species.
I've put a seq.fna of 0.44 gb under incoming/sergek
directory. Those are the nanopore 1D reads from those genome with the potential lambda contamination.
Many thanks!
Sorry for the delayed reply, after looking at the dataset, there does seem to be some variation in the sample which is preventing a more contiguous assembly. You can probably get a more contiguous assembly by using the heterozygous smash parameters from the FAQ and increasing the error rate to 0.25 from the default (but it will be slower). Another option is to assemble the Canu-corrected reads with something like smartdenovo which is more willing to collapse haplotypes.
No problem at all. I confirmed the existence of a phage by aligning my reads to the Enterobacteria phage HK630. The 10kb tail part of the page has coverage of 40,000 X, which suggest the phage existing in our sample might share similar sequence at the tail but have a vary head region.
I am trying to remove those reads and re-assemble. I'll post the results when I have them.
Closing, inactive.
Sorry for the delayed response.
As I said in the last comment, 39.97% of my reads partially mapped to a Enterobacteria phage. I removed those 40% reads and reassemble the genome with default Canu parameters for Nanopore data. Unfortunately this did not improve the assembly.
--
-- In gatekeeper store 'correction/oxk_reanalyze_filt_phageHK630_cov40.gkpStore':
-- Found 38120 reads.
-- Found 206984294 bases (83.12 times coverage).
--
-- Read length histogram (one '*' equals 328.68 reads):
-- 0 4999 23008 **********************************************************************
-- 5000 9999 10406 *******************************
-- 10000 14999 2918 ********
-- 15000 19999 1096 ***
-- 20000 24999 423 *
-- 25000 29999 157
-- 30000 34999 57
-- 35000 39999 24
-- 40000 44999 11
-- 45000 49999 10
-- 50000 54999 3
-- 55000 59999 2
-- 60000 64999 1
-- 65000 69999 0
-- 70000 74999 0
-- 75000 79999 2
-- 80000 84999 0
-- 85000 89999 0
-- 90000 94999 0
-- 95000 99999 0
-- 100000 104999 0
-- 105000 109999 0
-- 110000 114999 0
-- 115000 119999 0
-- 120000 124999 0
-- 125000 129999 1
-- 130000 134999 0
-- 135000 139999 0
-- 140000 144999 0
-- 145000 149999 0
-- 150000 154999 1
[CORRECTION/MERS]
--
-- 16-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 108390362 *******************************************************************--> 0.8372 0.5251
-- 2- 2 12751429 ********************************************************************** 0.9357 0.6487
-- 3- 4 4380637 ************************ 0.9600 0.6945
-- 5- 7 1246754 ****** 0.9744 0.7334
-- 8- 11 601027 *** 0.9806 0.7599
-- 12- 16 646517 *** 0.9848 0.7872
-- 17- 22 727511 *** 0.9898 0.8346
-- 23- 29 492906 ** 0.9952 0.9031
-- 30- 37 170817 0.9985 0.9585
-- 38- 46 34454 0.9996 0.9813
-- 47- 56 9032 0.9998 0.9871
-- 57- 67 4055 0.9999 0.9891
-- 68- 79 2503 0.9999 0.9902
-- 80- 92 1852 1.0000 0.9911
-- 93- 106 1101 1.0000 0.9919
-- 107- 121 757 1.0000 0.9924
-- 122- 137 490 1.0000 0.9928
-- 138- 154 352 1.0000 0.9931
-- 155- 172 260 1.0000 0.9933
-- 173- 191 201 1.0000 0.9935
-- 192- 211 157 1.0000 0.9937
-- 212- 232 110 1.0000 0.9938
-- 233- 254 104 1.0000 0.9940
-- 255- 277 74 1.0000 0.9941
-- 278- 301 81 1.0000 0.9942
-- 302- 326 80 1.0000 0.9943
-- 327- 352 78 1.0000 0.9944
-- 353- 379 81 1.0000 0.9945
-- 380- 407 65 1.0000 0.9947
-- 408- 436 46 1.0000 0.9948
-- 437- 466 47 1.0000 0.9949
-- 467- 497 40 1.0000 0.9950
-- 498- 529 43 1.0000 0.9951
-- 530- 562 37 1.0000 0.9952
-- 563- 596 42 1.0000 0.9953
-- 597- 631 28 1.0000 0.9954
-- 632- 667 19 1.0000 0.9955
-- 668- 704 16 1.0000 0.9956
-- 705- 742 10 1.0000 0.9956
-- 743- 781 18 1.0000 0.9957
-- 782- 821 12 1.0000 0.9957
--
-- 24559 (max occurrences)
-- 98022132 (total mers, non-unique)
-- 21074094 (distinct mers, non-unique)
-- 108390362 (unique mers)
[CORRECTION/CORRECTIONS]
--
-- Reads to be corrected:
-- 11558 reads longer than 6624 bp
-- 110220812 bp
-- Expected corrected reads:
-- 11558 reads
-- 99601872 bp
-- 4332 bp minimum length
-- 8618 bp mean length
-- 31857 bp n50 length
[TRIMMING/READS]
--
-- In gatekeeper store 'trimming/oxk_reanalyze_filt_phageHK630_cov40.gkpStore':
-- Found 12243 reads.
-- Found 100849920 bases (40.5 times coverage).
--
-- Read length histogram (one '*' equals 32.05 reads):
-- 0 999 0
-- 1000 1999 135 ****
-- 2000 2999 120 ***
-- 3000 3999 140 ****
-- 4000 4999 2015 **************************************************************
-- 5000 5999 2244 **********************************************************************
-- 6000 6999 1709 *****************************************************
-- 7000 7999 1280 ***************************************
-- 8000 8999 995 *******************************
-- 9000 9999 746 ***********************
-- 10000 10999 545 *****************
-- 11000 11999 427 *************
-- 12000 12999 360 ***********
-- 13000 13999 291 *********
-- 14000 14999 220 ******
-- 15000 15999 183 *****
-- 16000 16999 163 *****
-- 17000 17999 135 ****
-- 18000 18999 120 ***
-- 19000 19999 99 ***
-- 20000 20999 57 *
-- 21000 21999 63 *
-- 22000 22999 27
-- 23000 23999 30
-- 24000 24999 17
-- 25000 25999 24
-- 26000 26999 14
-- 27000 27999 14
-- 28000 28999 14
-- 29000 29999 12
-- 30000 30999 12
-- 31000 31999 4
-- 32000 32999 3
-- 33000 33999 1
-- 34000 34999 4
-- 35000 35999 6
-- 36000 36999 1
-- 37000 37999 2
-- 38000 38999 2
-- 39000 39999 4
-- 40000 40999 1
-- 41000 41999 1
-- 42000 42999 1
-- 43000 43999 0
-- 44000 44999 0
-- 45000 45999 1
-- 46000 46999 0
-- 47000 47999 0
-- 48000 48999 0
-- 49000 49999 0
-- 50000 50999 1
[TRIMMING/MERS]
--
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 6583745 *******************************************************************--> 0.6023 0.0654
-- 2- 2 808815 ********************************************************************** 0.6763 0.0815
-- 3- 4 535948 ********************************************** 0.7073 0.0916
-- 5- 7 310275 ************************** 0.7376 0.1062
-- 8- 11 211404 ****************** 0.7596 0.1225
-- 12- 16 179937 *************** 0.7766 0.1417
-- 17- 22 212949 ****************** 0.7925 0.1674
-- 23- 29 370652 ******************************** 0.8129 0.2133
-- 30- 37 657793 ******************************************************** 0.8491 0.3206
-- 38- 46 724870 ************************************************************** 0.9117 0.5563
-- 47- 56 282545 ************************ 0.9741 0.8450
-- 57- 67 35359 *** 0.9960 0.9665
-- 68- 79 4750 0.9985 0.9830
-- 80- 92 3853 0.9989 0.9864
-- 93- 106 3592 0.9992 0.9895
-- 107- 121 1080 0.9996 0.9930
-- 122- 137 1367 0.9997 0.9941
-- 138- 154 1693 0.9998 0.9959
-- 155- 172 192 0.9999 0.9983
-- 173- 191 120 0.9999 0.9985
-- 192- 211 213 1.0000 0.9988
-- 212- 232 156 1.0000 0.9992
-- 233- 254 19 1.0000 0.9995
-- 255- 277 12 1.0000 0.9996
-- 278- 301 7 1.0000 0.9996
-- 302- 326 5 1.0000 0.9996
-- 327- 352 18 1.0000 0.9996
-- 353- 379 0 0.0000 0.0000
-- 380- 407 0 0.0000 0.0000
-- 408- 436 0 0.0000 0.0000
-- 437- 466 0 0.0000 0.0000
-- 467- 497 2 1.0000 0.9997
-- 498- 529 0 0.0000 0.0000
-- 530- 562 1 1.0000 0.9997
-- 563- 596 2 1.0000 0.9997
-- 597- 631 7 1.0000 0.9997
-- 632- 667 2 1.0000 0.9998
-- 668- 704 0 0.0000 0.0000
-- 705- 742 0 0.0000 0.0000
-- 743- 781 0 0.0000 0.0000
-- 782- 821 0 0.0000 0.0000
--
-- 1753 (max occurrences)
-- 94009072 (total mers, non-unique)
-- 4347659 (distinct mers, non-unique)
-- 6583745 (unique mers)
[TRIMMING/TRIMMING]
-- PARAMETERS:
-- ----------
-- 1000 (reads trimmed below this many bases are deleted)
-- 0.1440 (use overlaps at or below this fraction error)
-- 1 (break region if overlap is less than this long, for 'largest covered' algorithm)
-- 1 (break region if overlap coverage is less than this many read, for 'largest covered' algorithm)
--
-- INPUT READS:
-- -----------
-- 12243 reads 100849920 bases (reads processed)
-- 0 reads 0 bases (reads not processed, previously deleted)
-- 0 reads 0 bases (reads not processed, in a library where trimming isn't allowed)
--
-- OUTPUT READS:
-- ------------
-- 9650 reads 76407005 bases (trimmed reads output)
-- 2579 reads 18653267 bases (reads with no change, kept as is)
-- 11 reads 29427 bases (reads with no overlaps, deleted)
-- 3 reads 11169 bases (reads with short trimmed length, deleted)
--
-- TRIMMING DETAILS:
-- ----------------
-- 4937 reads 2661208 bases (bases trimmed from the 5' end of a read)
-- 7881 reads 3087844 bases (bases trimmed from the 3' end of a read)
[TRIMMING/SPLITTING]
-- PARAMETERS:
-- ----------
-- 1000 (reads trimmed below this many bases are deleted)
-- 0.1440 (use overlaps at or below this fraction error)
-- INPUT READS:
-- -----------
-- 12229 reads 100809324 bases (reads processed)
-- 14 reads 40596 bases (reads not processed, previously deleted)
-- 0 reads 0 bases (reads not processed, in a library where trimming isn't allowed)
--
-- PROCESSED:
-- --------
-- 0 reads 0 bases (no overlaps)
-- 0 reads 0 bases (no coverage after adjusting for trimming done already)
-- 0 reads 0 bases (processed for chimera)
-- 0 reads 0 bases (processed for spur)
-- 12229 reads 100809324 bases (processed for subreads)
--
-- READS WITH SIGNALS:
-- ------------------
-- 0 reads 0 signals (number of 5' spur signal)
-- 0 reads 0 signals (number of 3' spur signal)
-- 0 reads 0 signals (number of chimera signal)
-- 94 reads 94 signals (number of subread signal)
--
-- SIGNALS:
-- -------
-- 0 reads 0 bases (size of 5' spur signal)
-- 0 reads 0 bases (size of 3' spur signal)
-- 0 reads 0 bases (size of chimera signal)
-- 94 reads 28870 bases (size of subread signal)
--
-- TRIMMING:
-- --------
-- 45 reads 253562 bases (trimmed from the 5' end of the read)
-- 49 reads 282368 bases (trimmed from the 3' end of the read)
[UNITIGGING/READS]
--
-- In gatekeeper store 'unitigging/oxk_reanalyze_filt_phageHK630_cov40.gkpStore':
-- Found 12229 reads.
-- Found 94524342 bases (37.96 times coverage).
--
-- Read length histogram (one '*' equals 32.95 reads):
-- 0 999 0
-- 1000 1999 134 ****
-- 2000 2999 147 ****
-- 3000 3999 232 *******
-- 4000 4999 2141 ****************************************************************
-- 5000 5999 2307 **********************************************************************
-- 6000 6999 1776 *****************************************************
-- 7000 7999 1316 ***************************************
-- 8000 8999 1004 ******************************
-- 9000 9999 729 **********************
-- 10000 10999 529 ****************
-- 11000 11999 429 *************
-- 12000 12999 322 *********
-- 13000 13999 259 *******
-- 14000 14999 182 *****
-- 15000 15999 155 ****
-- 16000 16999 137 ****
-- 17000 17999 105 ***
-- 18000 18999 103 ***
-- 19000 19999 73 **
-- 20000 20999 43 *
-- 21000 21999 42 *
-- 22000 22999 9
-- 23000 23999 13
-- 24000 24999 6
-- 25000 25999 8
-- 26000 26999 6
-- 27000 27999 7
-- 28000 28999 5
-- 29000 29999 2
-- 30000 30999 2
-- 31000 31999 3
-- 32000 32999 1
-- 33000 33999 0
-- 34000 34999 0
-- 35000 35999 1
-- 36000 36999 0
-- 37000 37999 1
[UNITIGGING/MERS]
--
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 5634635 *******************************************************************--> 0.5742 0.0598
-- 2- 2 734273 ********************************************************************* 0.6490 0.0754
-- 3- 4 494517 ********************************************** 0.6807 0.0853
-- 5- 7 292137 *************************** 0.7122 0.0997
-- 8- 11 204997 ******************* 0.7353 0.1162
-- 12- 16 182673 ***************** 0.7540 0.1364
-- 17- 22 235910 ********************** 0.7722 0.1648
-- 23- 29 431399 **************************************** 0.7977 0.2196
-- 30- 37 740960 ********************************************************************** 0.8452 0.3547
-- 38- 46 644939 ************************************************************ 0.9218 0.6305
-- 47- 56 186371 ***************** 0.9815 0.8932
-- 57- 67 15852 * 0.9973 0.9769
-- 68- 79 4920 0.9985 0.9845
-- 80- 92 3868 0.9990 0.9884
-- 93- 106 2015 0.9994 0.9919
-- 107- 121 1221 0.9996 0.9938
-- 122- 137 1886 0.9997 0.9953
-- 138- 154 469 0.9999 0.9979
-- 155- 172 119 0.9999 0.9985
-- 173- 191 169 1.0000 0.9987
-- 192- 211 218 1.0000 0.9991
-- 212- 232 24 1.0000 0.9995
-- 233- 254 6 1.0000 0.9996
-- 255- 277 13 1.0000 0.9996
-- 278- 301 6 1.0000 0.9996
-- 302- 326 18 1.0000 0.9996
-- 327- 352 1 1.0000 0.9997
-- 353- 379 0 0.0000 0.0000
-- 380- 407 0 0.0000 0.0000
-- 408- 436 0 0.0000 0.0000
-- 437- 466 1 1.0000 0.9997
-- 467- 497 1 1.0000 0.9997
-- 498- 529 0 0.0000 0.0000
-- 530- 562 3 1.0000 0.9997
-- 563- 596 6 1.0000 0.9997
-- 597- 631 2 1.0000 0.9998
-- 632- 667 1 1.0000 0.9998
-- 668- 704 0 0.0000 0.0000
-- 705- 742 0 0.0000 0.0000
-- 743- 781 0 0.0000 0.0000
-- 782- 821 0 0.0000 0.0000
--
-- 1147 (max occurrences)
-- 88632898 (total mers, non-unique)
-- 4179016 (distinct mers, non-unique)
-- 5634635 (unique mers)
[UNITIGGING/OVERLAPS]
-- category reads % read length feature size or coverage analysis
-- ---------------- ------- ------- ---------------------- ------------------------ --------------------
-- middle-missing 1 0.01 1630.00 +- 0.00 593.00 +- 0.00 (bad trimming)
-- middle-hump 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (bad trimming)
-- no-5-prime 1 0.01 1032.00 +- 0.00 444.00 +- 0.00 (bad trimming)
-- no-3-prime 1 0.01 1740.00 +- 0.00 1116.00 +- 0.00 (bad trimming)
--
-- low-coverage 2 0.02 1107.50 +- 68.59 2.46 +- 0.52 (easy to assemble, potential for lower quality consensus)
-- unique 9071 74.18 7504.22 +- 3648.41 36.19 +- 6.66 (easy to assemble, perfect, yay)
-- repeat-cont 101 0.83 5265.41 +- 1695.23 71.87 +- 21.11 (potential for consensus errors, no impact on assembly)
-- repeat-dove 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (hard to assemble, likely won't assemble correctly or even at all)
--
-- span-repeat 2158 17.65 8714.47 +- 4229.00 2602.94 +- 2587.29 (read spans a large repeat, usually easy to assemble)
-- uniq-repeat-cont 725 5.93 7352.06 +- 3047.25 (should be uniquely placed, low potential for consensus errors, no impact on assembly)
-- uniq-repeat-dove 46 0.38 15586.87 +- 5505.69 (will end contigs, potential to misassemble)
-- uniq-anchor 123 1.01 8634.78 +- 3405.03 2348.91 +- 2830.11 (repeat read, with unique section, probable bad read)
[UNITIGGING/ADJUSTMENT]
-- No report available.
[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
-- contigs: 31 sequences, total length 2869593 bp (including 1 repeats of total length 37069 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 1809 sequences, total length 14514783 bp.
--
-- Contig sizes based on genome size --
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1264266 1 1264266
-- 20 1264266 1 1264266
-- 30 1264266 1 1264266
-- 40 1264266 1 1264266
-- 50 1264266 1 1264266
-- 60 526253 2 1790519
-- 70 526253 2 1790519
-- 80 328024 3 2118543
-- 90 59023 5 2289615
-- 100 32956 10 2496209
-- 110 17154 21 2749093
--
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
-- contigs: 31 sequences, total length 2871943 bp (including 1 repeats of total length 37031 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 1809 sequences, total length 14515081 bp.
--
-- Contig sizes based on genome size --
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1263438 1 1263438
-- 20 1263438 1 1263438
-- 30 1263438 1 1263438
-- 40 1263438 1 1263438
-- 50 1263438 1 1263438
-- 60 526566 2 1790004
-- 70 526566 2 1790004
-- 80 329130 3 2119134
-- 90 59182 5 2290853
-- 100 32902 10 2498423
-- 110 17189 21 2752071
--
I think the issue is there is heterozygosity in the sample and the default error rate coupled with Canu being conservative when it seems sample variation is splitting the assembly. Can you share the log files (asm.*
) in unitigging/4-unitigger. Have you tried my suggestion to increase the error rate for assembly as well?
Definitely! I did not find asm.*
files under unitigging/4-unitigger
. Those are all the files in the folder. Which one should I share?
alignGFA.sh oxk_reanalyze_filt_phageHK630_cov40.best.singletons
oxk_reanalyze_filt_phageHK630_cov40.001.filterOverlaps.thr000.num000.log oxk_reanalyze_filt_phageHK630_cov40.best.spurs
oxk_reanalyze_filt_phageHK630_cov40.003.buildGreedy.sizes oxk_reanalyze_filt_phageHK630_cov40.contigs.aligned.gfa
oxk_reanalyze_filt_phageHK630_cov40.004.placeContains.sizes oxk_reanalyze_filt_phageHK630_cov40.contigs.aligned.gfa.err
oxk_reanalyze_filt_phageHK630_cov40.005.mergeOrphans.sizes oxk_reanalyze_filt_phageHK630_cov40.contigs.gfa
oxk_reanalyze_filt_phageHK630_cov40.005.mergeOrphans.thr000.num000.log oxk_reanalyze_filt_phageHK630_cov40.final.assembly.gfa
oxk_reanalyze_filt_phageHK630_cov40.005.mergeOrphans.unassembled oxk_reanalyze_filt_phageHK630_cov40.initial.assembly.gfa
oxk_reanalyze_filt_phageHK630_cov40.007.breakRepeats.sizes oxk_reanalyze_filt_phageHK630_cov40.unitigs.aligned.bed
oxk_reanalyze_filt_phageHK630_cov40.007.breakRepeats.thr000.num000.log oxk_reanalyze_filt_phageHK630_cov40.unitigs.aligned.bed.err
oxk_reanalyze_filt_phageHK630_cov40.008.cleanupMistakes.thr000.num000.log oxk_reanalyze_filt_phageHK630_cov40.unitigs.aligned.gfa
oxk_reanalyze_filt_phageHK630_cov40.009.generateOutputs.overlaps oxk_reanalyze_filt_phageHK630_cov40.unitigs.aligned.gfa.err
oxk_reanalyze_filt_phageHK630_cov40.009.generateOutputs.sizes oxk_reanalyze_filt_phageHK630_cov40.unitigs.bed
oxk_reanalyze_filt_phageHK630_cov40.009.generateOutputs.thr000.num000.log oxk_reanalyze_filt_phageHK630_cov40.unitigs.gfa
oxk_reanalyze_filt_phageHK630_cov40.011.generateUnitigs.thr000.num000.log unitigger.1.out
oxk_reanalyze_filt_phageHK630_cov40.best.contains.histogram unitigger.err
oxk_reanalyze_filt_phageHK630_cov40.best.edges unitigger.jobSubmit-01.out
oxk_reanalyze_filt_phageHK630_cov40.best.edges.gfa unitigger.jobSubmit-01.sh
oxk_reanalyze_filt_phageHK630_cov40.best.edges.histogram unitigger.sh
oxk_reanalyze_filt_phageHK630_cov40.best.edges.suspicious unitigger.success
Hi I tried the smashing parameterscorOutCoverage=100 overlapper=mhap utgReAlign=true correctedErrorRate=0.20 'batOptions=-dg 50 -db 50 -dr 1 -ca 500 -cp 50
And it did not work.
I have't try just increasing the error rate to 0.25 but I can definitely try it now.
The files named oxk_reanalyze_filt_phageHK630_cov4.*
would work.
Many thanks!
Yes, definitely looks like some variation is preventing this from being circularized/causing the splits. Initially, the contig is the single chromosome:
cat oxk_reanalyze_filt_phageHK630_cov40.005.mergeOrphans.sizes
CONTIGS (23 tigs) (2828698 length) (122986 average) (1.14x coverage)
ng010 2153084 lg010 1 sum 2153084 (CONTIGS)
ng020 2153084 lg020 1 sum 2153084 (CONTIGS)
ng030 2153084 lg030 1 sum 2153084 (CONTIGS)
ng040 2153084 lg040 1 sum 2153084 (CONTIGS)
ng050 2153084 lg050 1 sum 2153084 (CONTIGS)
ng060 2153084 lg060 1 sum 2153084 (CONTIGS)
ng070 2153084 lg070 1 sum 2153084 (CONTIGS)
ng080 2153084 lg080 1 sum 2153084 (CONTIGS)
ng090 112049 lg090 2 sum 2265133 (CONTIGS)
ng100 36426 lg100 7 sum 2495417 (CONTIGS)
ng110 21013 lg110 17 sum 2751967 (CONTIGS)
but is then split because it has poor support and conflicting evidence. If you want to get that initial contig run with canu -assemble -nanopore-corrected <your asm folder>/*.trimmedReads.fastq.gz overlapper=mhap utgReAlign=true correctedErrorRate=0.20 batOptions=-dg 50 -db 50 -dr 1 -ca 0 -cp 0
Thanks for the explanation!!
I tried your suggestion
/ifs/home/lium14/tools/canu-1.6/*/bin/canu \
-assemble \
-p $JOB_NAME_continous_contig \
-d $output/1_assemble/$JOB_NAME_continous_contig \
genomeSize=2.49m \
overlapper=mhap utgReAlign=true \
correctedErrorRate=0.20 \
batOptions='-dg 50 -db 50 -dr 1 -ca 0 -cp 0'
-nanopore-corrected $output/1_assemble/$JOB_NAME/$JOB_NAME.trimmedReads.fasta.gz
But keep getting this error
ERROR: File supplied on command line; use -s, -pacbio-raw, -pacbio-corrected, -nanopore-raw, or -nanopore-corrected.
I did provide the correct nanopore corrected read...
I'd guess it's getting confused by the directory being inside the previous run. Try renaming the trimmed reads to be something else and putting the -d folder outside the previous run.
Thanks. It worked.
Hi, I am using Canu 1.6 to assemble two closely related bacterial genome of 2.5mb from nanopore whole genome sequencing. However, I was able to get a close and circularized for one genome but not the other.
The genome I was able to close had 240X coverage, while the unclosed one had 150X. For the latter case, there's 13%
repeat-cont
in theunitigging
step. I try to increase the continuity by settingcorOutCoverage=100
but it actually worsened the assembly.I wonder
repeat-cont
in theunitigging
did differ between two assembly (13% VS 0.07%)Any suggestions will be highly appreciated! I put the report for the closed genome(first) and unclosed genome(second) below.
Thanks a lot!