Closed sneupane01 closed 4 years ago
I'm not sure what you're asking, why you get more than 1 contig? Do you know if your genome has plasmids, it's also possible it has control sequence or other non-clonal variants. At least two of your contigs (tig00000045 and tig00000046) have many reads and are circular and in the right size for a plasmid. I would not discard any of the contigs for downstream analysis, I would also suggest blasting them and comparing them to the main chromosome scaffold you have (to decide if they are contained or independent plasmids).
Thank you very much. So I am in right direction then? Should I have to look for other parameters, as this bacteria has highly repetitive regions?
No other parameters necessary, if you've got a 1.5 mb circular contig, you've likely got the chromosome assembled. The other contigs are like I said plasmids or variants, no parameters would incorporate them into the main chromosome. You should figure out what's in the contigs you have rather than trying to change parameters.
Yes. But 1.5Mb contig is not getting circular using either circlator/minimus2. My concern is did it capture whole assembly.
Ah I thought circulator gave you a circular contig.I just noticed, this is CCS data? Run with -pacbio-hifi in the 1.9 release instead.
It is from CCS. I am using Canu branch v1.9 +314 changes (r9524 b6c7bd2afc76b69e88cf781294e5d82246c68959). Is this the latest?
Use the posted release version instead, I don't think this is the latest and I presume you're not giving the data as -pacbio-hifi
but -pacbio-corrected
, that will change quite a few parameters.
Thank you. I will ask HPC to update.
Any updates?
I tried with -pacbio-hifi and it produced too many contigs. As I might need to change many parameters as I did earlier to acheive results from -pacbio-corrected. I want to continue with the results I got from -pacbio-corrected. But I am getting difficulty in acheiving circularity.
I did second denovo after mapping contigs back to reads using ngmlr. Canu Result:
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
-- contigs: 5 sequences, total length 1634144 bp (including 0 repeats of total length 0 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 739 sequences, total length 4455931 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1538523 1 1538523
-- 20 1538523 1 1538523
-- 30 1538523 1 1538523
-- 40 1538523 1 1538523
-- 50 1538523 1 1538523
-- 60 1538523 1 1538523
-- 70 1538523 1 1538523
-- 80 1538523 1 1538523
-- 90 1538523 1 1538523
-- 100 1538523 1 1538523
Infoseq:
[sneupane@login3 NGMLR_1000_0.015_500_Analysis_15_good]$ infoseq wDi.contigs.fasta
Display basic information about sequences
USA Database Name Accession Type Length %GC Organism Description
fasta::wDi.contigs.fasta:tig00000001 - tig00000001 - N 1538523 33.92 len=1538523 reads=8370 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000002 - tig00000002 - N 40440 35.64 len=40440 reads=189 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000006 - tig00000006 - N 33139 33.80 len=33139 reads=135 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000007 - tig00000007 - N 6784 33.62 len=6784 reads=3 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000029 - tig00000029 - N 15258 37.68 len=15258 reads=72 class=contig suggestRepeat=no suggestCircular=yes
Still I am trying to circularize the 1.53Mb contig. But it fails.
I don't see why mapping would change anything about your assembly, it's going to be the same reads you're assembling before. Numbers of contigs don't really matter if one of them is circular, did you check that for the -pacbio-hifi results? Post the report from the run.
You should be looking at the gfa outputs, the unitigs.gfa file is what you want. That will show you if you have a circular path in the current contigs and give you an idea of what you'd need to join to circularize and potentially why it's not being resolved. It's possible your reads are too short for the repeats in your genome.
A] hifi results:
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
-- contigs: 14 sequences, total length 1829955 bp (including 0 repeats of total length 0 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 1369 sequences, total length 8446160 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1444849 1 1444849
-- 20 1444849 1 1444849
-- 30 1444849 1 1444849
-- 40 1444849 1 1444849
-- 50 1444849 1 1444849
-- 60 1444849 1 1444849
-- 70 1444849 1 1444849
-- 80 1444849 1 1444849
-- 90 1444849 1 1444849
-- 100 152655 2 1597504
-- 110 102010 3 1699514
-- 120 9162 10 1805030
This assembly should be refined using many parameters.
B] Right now I am assembling from different samples (six samples; same bacteria collecting from different passages and extraction methods). So far in one of the sample I have found such .gfa vizualized from bandage: It shows almost circular. My question is how can I make it circular?
If possible I can send you the contigs file. Let me know if possible.
Given that you consistently get a non-circular assembly, I don't think you can circularize this with your data. There is likely a repeat that is too large or a structural variant present in a subset of your population. The report includes some overlap stats which may give a better idea of the size of the repeat (or the size of that extra contig in the almost circular unitig gfa). Most likely you need longer reads to resolve this genome into a circle.
We are still making some improvements to hifi-based assembly so if you're able to share your data (see the FAQ for how to send it), we could try more recent assembler versions locally.
Thank you for the suggestion: Overlap report for above gfa:
[UNITIGGING/OVERLAPS]
-- category reads % read length feature size or coverage analysis
-- ---------------- ------- ------- ---------------------- ------------------------ --------------------
-- middle-missing 2 0.02 5525.50 +- 1902.82 1339.50 +- 1311.68 (bad trimming)
-- middle-hump 2 0.02 4298.00 +- 1733.83 0.00 +- 0.00 (bad trimming)
-- no-5-prime 3 0.03 6196.33 +- 229.44 224.33 +- 388.56 (bad trimming)
-- no-3-prime 5 0.05 5868.20 +- 162.32 237.80 +- 374.81 (bad trimming)
--
-- low-coverage 16 0.17 4177.62 +- 2015.85 4.10 +- 1.72 (easy to assemble, potential for lower quality consensus)
-- unique 8578 89.01 6065.91 +- 378.91 35.30 +- 6.89 (easy to assemble, perfect, yay)
-- repeat-cont 134 1.39 5688.00 +- 912.32 76.80 +- 7.40 (potential for consensus errors, no impact on assembly)
-- repeat-dove 37 0.38 6484.08 +- 419.43 77.21 +- 7.84 (hard to assemble, likely won't assemble correctly or even
at all)
--
-- span-repeat 244 2.53 6038.90 +- 1006.24 2877.12 +- 1726.14 (read spans a large repeat, usually easy to assemble)
-- uniq-repeat-cont 395 4.10 5847.45 +- 479.29 (should be uniquely placed, low potential for consensus err
ors, no impact on assembly)
-- uniq-repeat-dove 214 2.22 6307.22 +- 490.93 (will end contigs, potential to misassemble)
-- uniq-anchor 6 0.06 5314.17 +- 776.05 2751.00 +- 1438.69 (repeat read, with unique section, probable bad read)
Please suggest if anything you can get.
Another question: This might not be related particular to canu but I achieved one circular 1.41Mb from another set of sample using flye -subassembly in which I sub-assembled contig outputs from CANU, unicycler and HGAP4. My question is is it the false circularization? Generally, the size for this bacteria ranges from 1.2-1.5Mb. This specific bacteria is never been fully resolved before. How can I make sure this is the size I obtained as complete. I used anvio and it says 93.4% complete with 0% duplications. It might have compared to genus of this bacteria. I am posting the gfa image for the circular one:
Another question Is it possible to publish this genome as I have seen many papers publishing the papers with multiple contigs (without circularization)? Here we have achieved at least the sizeable genome?
Another question We are planning to perform illumina sequencing again for this bacteria. What do you suggest? either illumina or pacbio (with long reads)?
I have upload the reads.fastq file in the incoming/sergek. Please check the file (if other file is present, it is the same file). It should be around 488616081 bytes.
I would appreciate with your upcoming suggestions. Thank You
I don't generally trust assemblies of assemblies. If none of the three assemblers you tried independently gave you a circularized genome, then I wouldn't trust the combination of them.
I'd expect you can publish this genome as a genome announcement, that's the most common publication of bacteria I've seen.
Do you mean illumina to improve base quality or for assembly? Illumina is certainly not going to give you a circular genome. I'd suggest using PacBio CLR or nanopore reads targeting ones >5 or 10kb.
Your reads are quite short with an N50 and mean <5kb. Typically, the longest bacterial repeats are about 7kb so you want reads to be >=7kb. You've only got about 5x of reads >7kb right now. From your log you have at least some 6kb+ repeats, these may be not resolved depending on how lucky you got with this 5x of longer reads you have. I'll see what we get for an assembly locally but I expect your reads are too short.
I was able to circularize your data by being conservative with the error rate. These data seem like a mix of older and newer data, or at least their is a large variation in both read length and quality. I also enabled trimming which is normally not needed for HiFi data but since you have so much coverage and there were reads with poor support from other data, it was helpful. The error rate was the key parameter to reduce, I saw the same 1.5mb circular contig with a minimum read length of 1000, 2000, 3000, or 4000.
The command I used with 1.9 was:
canu -p asm -d asm -trim-assemble genomeSize=1.5m correctedErrorRate=0.001 cnsErrorRate=0.050 minReadLength=3000 -pacbio-hifi reads.fastq
Here is the tail of the log:
[UNITIGGING/OVERLAPS]
-- category reads % read length feature size or coverage analysis
-- ---------------- ------- ------- ---------------------- ------------------------ --------------------
-- middle-missing 2 0.01 6149.50 +- 169.00 114.00 +- 35.36 (bad trimming)
-- middle-hump 0 0.00 0.00 +- 0.00 0.00 +- 0.00 (bad trimming)
-- no-5-prime 2 0.01 3135.00 +- 111.72 59.00 +- 33.94 (bad trimming)
-- no-3-prime 2 0.01 3395.50 +- 379.72 108.50 +- 153.44 (bad trimming)
--
-- low-coverage 23 0.08 3526.09 +- 351.13 10.55 +- 3.18 (easy to assemble, potential for lower quality consensus)
-- unique 25527 84.00 3654.24 +- 463.30 92.12 +- 11.29 (easy to assemble, perfect, yay)
-- repeat-cont 1799 5.92 3560.43 +- 401.57 198.95 +- 33.95 (potential for consensus errors, no impact on assembly)
-- repeat-dove 142 0.47 4507.04 +- 285.66 198.55 +- 33.23 (hard to assemble, likely won't assemble correctly or even at all)
--
-- span-repeat 1285 4.23 3699.26 +- 465.60 692.58 +- 731.21 (read spans a large repeat, usually easy to assemble)
-- uniq-repeat-cont 1309 4.31 3553.17 +- 378.88 (should be uniquely placed, low potential for consensus errors, no impact on assembly)
-- uniq-repeat-dove 280 0.92 4317.03 +- 327.99 (will end contigs, potential to misassemble)
-- uniq-anchor 19 0.06 3893.47 +- 558.89 1893.63 +- 659.25 (repeat read, with unique section, probable bad read)
[UNITIGGING/ADJUSTMENT]
-- No report available.
[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
-- contigs: 10 sequences, total length 1195094 bp (including 0 repeats of total length 0 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 5348 sequences, total length 18924789 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1026753 1 1026753
-- 20 1026753 1 1026753
-- 30 1026753 1 1026753
-- 40 1026753 1 1026753
-- 50 1026753 1 1026753
-- 60 1026753 1 1026753
-- 70 76712 2 1103465
--
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
-- contigs: 10 sequences, total length 1792775 bp (including 0 repeats of total length 0 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 5348 sequences, total length 27649376 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1518502 1 1518502
-- 20 1518502 1 1518502
-- 30 1518502 1 1518502
-- 40 1518502 1 1518502
-- 50 1518502 1 1518502
-- 60 1518502 1 1518502
-- 70 1518502 1 1518502
-- 80 1518502 1 1518502
-- 90 1518502 1 1518502
-- 100 1518502 1 1518502
-- 110 53768 3 1703412
tig00000001 len=1518502 reads=21871 class=contig suggestRepeat=no suggestCircular=yes
tig00000002 len=8710 reads=4 class=contig suggestRepeat=no suggestCircular=no
tig00000015 len=131142 reads=1793 class=contig suggestRepeat=no suggestCircular=no
tig00000016 len=10946 reads=63 class=contig suggestRepeat=no suggestCircular=no
tig00000017 len=53768 reads=804 class=contig suggestRepeat=no suggestCircular=no
tig00000018 len=5823 reads=3 class=contig suggestRepeat=no suggestCircular=no
tig00000030 len=19895 reads=459 class=contig suggestRepeat=no suggestCircular=yes
tig00000031 len=15241 reads=15 class=contig suggestRepeat=no suggestCircular=no
tig00000032 len=8555 reads=5 class=contig suggestRepeat=no suggestCircular=no
tig00000036 len=20193 reads=24 class=contig suggestRepeat=no suggestCircular=no
However, I'd still suggest you validate this assembly with orthogonal data (e.g. long pacbio CLR, nanopore, lab methods). Getting longer reads in general would also be helpful here as I mentioned before.
Thank you very much. So, I will try this parameters with other five samples and keep you updated.
@skoren. Thank you for the parameters. I am able to get circular contig for the three samples with same parameters.
However I am not being able to get circular for other three samples.I tried changing correctedErrorRate from 0.001 to 0.005. The value 0.0025 gave the required contig size but it is more than expected size and not circular.
canu -p wDi -d wDi -trim-assemble genomeSize=1.5m correctedErrorRate=0.0025 cnsErrorRate=0.050 minReadLength=3000 useGrid=false -pacbio-hifi reads1.fastq
Example:
[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
-- contigs: 7 sequences, total length 1175950 bp (including 0 repeats of total length 0 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 2750 sequences, total length 9932113 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1039365 1 1039365
-- 20 1039365 1 1039365
-- 30 1039365 1 1039365
-- 40 1039365 1 1039365
-- 50 1039365 1 1039365
-- 60 1039365 1 1039365
-- 70 71087 2 1110452
--
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
-- contigs: 7 sequences, total length 1870024 bp (including 0 repeats of total length 0 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 2750 sequences, total length 15090775 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1647707 1 1647707
-- 20 1647707 1 1647707
-- 30 1647707 1 1647707
-- 40 1647707 1 1647707
-- 50 1647707 1 1647707
-- 60 1647707 1 1647707
-- 70 1647707 1 1647707
-- 80 1647707 1 1647707
-- 90 1647707 1 1647707
-- 100 1647707 1 1647707
-- 110 108666 2 1756373
-- 120 51366 3 1807739
What other parameters can be helpful? Please let me know.
There's no guarantee all samples can be circularized, your reads lengths are very short. You can try sweeping the error rate together with the minimum read length and try running circlator or mummer to check for circularity yourself (in case the contig is circular but canu didn't mark it as such).
@skoren Is it possible I send three read files and look upon it? Let me know. I would greatly appreciate it. I did not have faced so difficulty in assembling before. I have sent them. Thank You.
@skoren I have one question regarding the output contig (obtained above) as suggested by CANU as circular. I polished the suggested circular contig using minimap2-racon. Now, should I have to use circlator or minimus2 to get circular? I used circlator and could not get the circular in the log file. I assume the contig is already circular. Please correct me.
Polishing for hifi data is not needed, the canu output is going to be over Q40, likely close to Q50 already. Haven't personally used circlator so not sure what it's requirements are. It's possible racon removed the ends of the contig that overlapped so it is no longer circular.
If you're going to polish I suggest using minipolish which should handle the circularization . You can also check circularity yourself for all your assemblies (including ones canu didn't mark as circular) using mummer, something like:
nucmer -maxmatch -nosimplify asm.fa asm.fa
show-coords -lrcTH out.delta
and look for matches at the start of the large contig to its end. Then just remove the overlapping coordinates yourself.
@skoren These are the gfa files obtained from pacbio-hifi that suggested the required contig as circular. But I don't see the particular suggestive circular path in the bandage. Also, I ran mummer to the suggestive circular contig directly without polishing:
Sample4
1 2899 1515601 1518502 2899 2902 99.90 1518502 1518502 0.19 0.19 tig00000001 tig00000001 1515601 1518502 1 2899 2902 2899 99.90 1518502 1518502 0.19 0.19 tig00000001 tig00000001
So the required length is 1-1515601? Please suggest me.
For other two samples I could not find such overlap. I will install minipolish and update you.
@skoren How to use minipolish just for the suggestive circular contig without using miniasm assembly?
@skoren Thank You very much.
Hi, I used following parameters for canu assembly:
I got an expected sized contig of 1.5Mb. But I could not understand why it is not possible to obtain as '1 sequences, total length ......'. I changed many parameters. And I see lot of small pieces inside contains.fasta file:
Should I have to include other pieces in further analysis? I used circlator and it 1.53Mb got merged with 1.3kb contig. Please suggest how to proceed.
Thank You