marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
659 stars 179 forks source link

Canu output!! #1550

Closed sneupane01 closed 4 years ago

sneupane01 commented 5 years ago

Hi, I used following parameters for canu assembly:

canu -p wDi -d ALL_wDi_canu_seqtk45_CCS_1000_0.015_1000 genomeSize=1.5m correctedErrorRate=0.015 ovlMerThreshold=1000 batOptions="-eg 0.01 -eM 0.01 -dg 6 -db 6 -dr 1 -ca 50 -cp 5" useGrid=false m54115_190802_131940.Q20_seqtk_leftrim45_filtlong.fastq
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      9 sequences, total length 1689405 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  880 sequences, total length 4506091 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1535508             1     1535508
--     20     1535508             1     1535508
--     30     1535508             1     1535508
--     40     1535508             1     1535508
--     50     1535508             1     1535508
--     60     1535508             1     1535508
--     70     1535508             1     1535508
--     80     1535508             1     1535508
--     90     1535508             1     1535508
--    100     1535508             1     1535508
--    110       15841                5     1658888

I got an expected sized contig of 1.5Mb. But I could not understand why it is not possible to obtain as '1 sequences, total length ......'. I changed many parameters. And I see lot of small pieces inside contains.fasta file:

fasta::wDi.contigs.fasta:tig00000001 -              tig00000001    -              N    **1535508** 33.93                      len=1535508 reads=6174 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000003 -              tig00000003    -              N    13720  32.17                      len=13720 reads=17 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000006 -              tig00000006    -              N    44463  35.57                      len=44463 reads=135 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000008 -              tig00000008    -              N    9694   31.77                      len=9694 reads=9 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000010 -              tig00000010    -              N    39159  33.95                      len=39159 reads=149 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000045 -              tig00000045    -              N    23917  18.16                      len=23917 reads=60 class=contig suggestRepeat=no suggestCircular=yes
fasta::wDi.contigs.fasta:tig00000046 -              tig00000046    -              N    15841  37.03                      len=15841 reads=144 class=contig suggestRepeat=no suggestCircular=yes
fasta::wDi.contigs.fasta:tig00000047 -              tig00000047    -              N    4635   35.81                      len=4635 reads=8 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000061 -              tig00000061    -              N    2468   37.24                      len=2468 reads=3 class=contig suggestRepeat=no suggestCircular=no

Should I have to include other pieces in further analysis? I used circlator and it 1.53Mb got merged with 1.3kb contig. Please suggest how to proceed.

Thank You

skoren commented 5 years ago

I'm not sure what you're asking, why you get more than 1 contig? Do you know if your genome has plasmids, it's also possible it has control sequence or other non-clonal variants. At least two of your contigs (tig00000045 and tig00000046) have many reads and are circular and in the right size for a plasmid. I would not discard any of the contigs for downstream analysis, I would also suggest blasting them and comparing them to the main chromosome scaffold you have (to decide if they are contained or independent plasmids).

sneupane01 commented 5 years ago

Thank you very much. So I am in right direction then? Should I have to look for other parameters, as this bacteria has highly repetitive regions?

skoren commented 5 years ago

No other parameters necessary, if you've got a 1.5 mb circular contig, you've likely got the chromosome assembled. The other contigs are like I said plasmids or variants, no parameters would incorporate them into the main chromosome. You should figure out what's in the contigs you have rather than trying to change parameters.

sneupane01 commented 5 years ago

Yes. But 1.5Mb contig is not getting circular using either circlator/minimus2. My concern is did it capture whole assembly.

skoren commented 5 years ago

Ah I thought circulator gave you a circular contig.I just noticed, this is CCS data? Run with -pacbio-hifi in the 1.9 release instead.

sneupane01 commented 5 years ago

It is from CCS. I am using Canu branch v1.9 +314 changes (r9524 b6c7bd2afc76b69e88cf781294e5d82246c68959). Is this the latest?

skoren commented 5 years ago

Use the posted release version instead, I don't think this is the latest and I presume you're not giving the data as -pacbio-hifi but -pacbio-corrected, that will change quite a few parameters.

sneupane01 commented 5 years ago

Thank you. I will ask HPC to update.

skoren commented 4 years ago

Any updates?

sneupane01 commented 4 years ago

I tried with -pacbio-hifi and it produced too many contigs. As I might need to change many parameters as I did earlier to acheive results from -pacbio-corrected. I want to continue with the results I got from -pacbio-corrected. But I am getting difficulty in acheiving circularity.

sneupane01 commented 4 years ago

I did second denovo after mapping contigs back to reads using ngmlr. Canu Result:

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      5 sequences, total length 1634144 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  739 sequences, total length 4455931 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1538523             1     1538523
--     20     1538523             1     1538523
--     30     1538523             1     1538523
--     40     1538523             1     1538523
--     50     1538523             1     1538523
--     60     1538523             1     1538523
--     70     1538523             1     1538523
--     80     1538523             1     1538523
--     90     1538523             1     1538523
--    100     1538523             1     1538523

Infoseq:

[sneupane@login3 NGMLR_1000_0.015_500_Analysis_15_good]$ infoseq wDi.contigs.fasta
Display basic information about sequences
USA                      Database  Name           Accession      Type Length %GC    Organism            Description 
fasta::wDi.contigs.fasta:tig00000001 -              tig00000001    -              N    1538523 33.92                      len=1538523 reads=8370 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000002 -              tig00000002    -              N    40440  35.64                      len=40440 reads=189 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000006 -              tig00000006    -              N    33139  33.80                      len=33139 reads=135 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000007 -              tig00000007    -              N    6784   33.62                      len=6784 reads=3 class=contig suggestRepeat=no suggestCircular=no
fasta::wDi.contigs.fasta:tig00000029 -              tig00000029    -              N    15258  37.68                      len=15258 reads=72 class=contig suggestRepeat=no suggestCircular=yes

Still I am trying to circularize the 1.53Mb contig. But it fails.

skoren commented 4 years ago

I don't see why mapping would change anything about your assembly, it's going to be the same reads you're assembling before. Numbers of contigs don't really matter if one of them is circular, did you check that for the -pacbio-hifi results? Post the report from the run.

You should be looking at the gfa outputs, the unitigs.gfa file is what you want. That will show you if you have a circular path in the current contigs and give you an idea of what you'd need to join to circularize and potentially why it's not being resolved. It's possible your reads are too short for the repeats in your genome.

sneupane01 commented 4 years ago

A] hifi results:

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      14 sequences, total length 1829955 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  1369 sequences, total length 8446160 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1444849             1     1444849
--     20     1444849             1     1444849
--     30     1444849             1     1444849
--     40     1444849             1     1444849
--     50     1444849             1     1444849
--     60     1444849             1     1444849
--     70     1444849             1     1444849
--     80     1444849             1     1444849
--     90     1444849             1     1444849
--    100      152655             2     1597504
--    110      102010             3     1699514
--    120        9162            10     1805030

This assembly should be refined using many parameters.

B] Right now I am assembling from different samples (six samples; same bacteria collecting from different passages and extraction methods). So far in one of the sample I have found such .gfa vizualized from bandage: Screen Shot 2019-12-02 at 3 16 44 PM It shows almost circular. My question is how can I make it circular?

sneupane01 commented 4 years ago

If possible I can send you the contigs file. Let me know if possible.

skoren commented 4 years ago

Given that you consistently get a non-circular assembly, I don't think you can circularize this with your data. There is likely a repeat that is too large or a structural variant present in a subset of your population. The report includes some overlap stats which may give a better idea of the size of the repeat (or the size of that extra contig in the almost circular unitig gfa). Most likely you need longer reads to resolve this genome into a circle.

We are still making some improvements to hifi-based assembly so if you're able to share your data (see the FAQ for how to send it), we could try more recent assembler versions locally.

sneupane01 commented 4 years ago

Thank you for the suggestion: Overlap report for above gfa:

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing          2    0.02     5525.50 +- 1902.82       1339.50 +- 1311.68    (bad trimming)
--   middle-hump             2    0.02     4298.00 +- 1733.83          0.00 +- 0.00       (bad trimming)
--   no-5-prime              3    0.03     6196.33 +- 229.44         224.33 +- 388.56     (bad trimming)
--   no-3-prime              5    0.05     5868.20 +- 162.32         237.80 +- 374.81     (bad trimming)
--   
--   low-coverage           16    0.17     4177.62 +- 2015.85          4.10 +- 1.72       (easy to assemble, potential for lower quality consensus)
--   unique               8578   89.01     6065.91 +- 378.91          35.30 +- 6.89       (easy to assemble, perfect, yay)
--   repeat-cont           134    1.39     5688.00 +- 912.32          76.80 +- 7.40       (potential for consensus errors, no impact on assembly)
--   repeat-dove            37    0.38     6484.08 +- 419.43          77.21 +- 7.84       (hard to assemble, likely won't assemble correctly or even 
at all)
--   
--   span-repeat           244    2.53     6038.90 +- 1006.24       2877.12 +- 1726.14    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont      395    4.10     5847.45 +- 479.29                              (should be uniquely placed, low potential for consensus err
ors, no impact on assembly)
--   uniq-repeat-dove      214    2.22     6307.22 +- 490.93                              (will end contigs, potential to misassemble)
--   uniq-anchor             6    0.06     5314.17 +- 776.05        2751.00 +- 1438.69    (repeat read, with unique section, probable bad read)

Please suggest if anything you can get.

Another question: This might not be related particular to canu but I achieved one circular 1.41Mb from another set of sample using flye -subassembly in which I sub-assembled contig outputs from CANU, unicycler and HGAP4. My question is is it the false circularization? Generally, the size for this bacteria ranges from 1.2-1.5Mb. This specific bacteria is never been fully resolved before. How can I make sure this is the size I obtained as complete. I used anvio and it says 93.4% complete with 0% duplications. It might have compared to genus of this bacteria. I am posting the gfa image for the circular one: Screen Shot 2019-12-02 at 4 04 55 PM

Another question Is it possible to publish this genome as I have seen many papers publishing the papers with multiple contigs (without circularization)? Here we have achieved at least the sizeable genome?

Another question We are planning to perform illumina sequencing again for this bacteria. What do you suggest? either illumina or pacbio (with long reads)?

I have upload the reads.fastq file in the incoming/sergek. Please check the file (if other file is present, it is the same file). It should be around 488616081 bytes.

I would appreciate with your upcoming suggestions. Thank You

skoren commented 4 years ago
  1. I don't generally trust assemblies of assemblies. If none of the three assemblers you tried independently gave you a circularized genome, then I wouldn't trust the combination of them.

  2. I'd expect you can publish this genome as a genome announcement, that's the most common publication of bacteria I've seen.

  3. Do you mean illumina to improve base quality or for assembly? Illumina is certainly not going to give you a circular genome. I'd suggest using PacBio CLR or nanopore reads targeting ones >5 or 10kb.

  4. Your reads are quite short with an N50 and mean <5kb. Typically, the longest bacterial repeats are about 7kb so you want reads to be >=7kb. You've only got about 5x of reads >7kb right now. From your log you have at least some 6kb+ repeats, these may be not resolved depending on how lucky you got with this 5x of longer reads you have. I'll see what we get for an assembly locally but I expect your reads are too short.

skoren commented 4 years ago

I was able to circularize your data by being conservative with the error rate. These data seem like a mix of older and newer data, or at least their is a large variation in both read length and quality. I also enabled trimming which is normally not needed for HiFi data but since you have so much coverage and there were reads with poor support from other data, it was helpful. The error rate was the key parameter to reduce, I saw the same 1.5mb circular contig with a minimum read length of 1000, 2000, 3000, or 4000.

The command I used with 1.9 was: canu -p asm -d asm -trim-assemble genomeSize=1.5m correctedErrorRate=0.001 cnsErrorRate=0.050 minReadLength=3000 -pacbio-hifi reads.fastq

Here is the tail of the log:

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing          2    0.01     6149.50 +- 169.00         114.00 +- 35.36      (bad trimming)
--   middle-hump             0    0.00        0.00 +- 0.00             0.00 +- 0.00       (bad trimming)
--   no-5-prime              2    0.01     3135.00 +- 111.72          59.00 +- 33.94      (bad trimming)
--   no-3-prime              2    0.01     3395.50 +- 379.72         108.50 +- 153.44     (bad trimming)
--   
--   low-coverage           23    0.08     3526.09 +- 351.13          10.55 +- 3.18       (easy to assemble, potential for lower quality consensus)
--   unique              25527   84.00     3654.24 +- 463.30          92.12 +- 11.29      (easy to assemble, perfect, yay)
--   repeat-cont          1799    5.92     3560.43 +- 401.57         198.95 +- 33.95      (potential for consensus errors, no impact on assembly)
--   repeat-dove           142    0.47     4507.04 +- 285.66         198.55 +- 33.23      (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat          1285    4.23     3699.26 +- 465.60         692.58 +- 731.21     (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont     1309    4.31     3553.17 +- 378.88                              (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove      280    0.92     4317.03 +- 327.99                              (will end contigs, potential to misassemble)
--   uniq-anchor            19    0.06     3893.47 +- 558.89        1893.63 +- 659.25     (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      10 sequences, total length 1195094 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  5348 sequences, total length 18924789 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1026753             1     1026753
--     20     1026753             1     1026753
--     30     1026753             1     1026753
--     40     1026753             1     1026753
--     50     1026753             1     1026753
--     60     1026753             1     1026753
--     70       76712             2     1103465
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      10 sequences, total length 1792775 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  5348 sequences, total length 27649376 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1518502             1     1518502
--     20     1518502             1     1518502
--     30     1518502             1     1518502
--     40     1518502             1     1518502
--     50     1518502             1     1518502
--     60     1518502             1     1518502
--     70     1518502             1     1518502
--     80     1518502             1     1518502
--     90     1518502             1     1518502
--    100     1518502             1     1518502
--    110       53768             3     1703412

tig00000001 len=1518502 reads=21871 class=contig suggestRepeat=no suggestCircular=yes
tig00000002 len=8710 reads=4 class=contig suggestRepeat=no suggestCircular=no
tig00000015 len=131142 reads=1793 class=contig suggestRepeat=no suggestCircular=no
tig00000016 len=10946 reads=63 class=contig suggestRepeat=no suggestCircular=no
tig00000017 len=53768 reads=804 class=contig suggestRepeat=no suggestCircular=no
tig00000018 len=5823 reads=3 class=contig suggestRepeat=no suggestCircular=no
tig00000030 len=19895 reads=459 class=contig suggestRepeat=no suggestCircular=yes
tig00000031 len=15241 reads=15 class=contig suggestRepeat=no suggestCircular=no
tig00000032 len=8555 reads=5 class=contig suggestRepeat=no suggestCircular=no
tig00000036 len=20193 reads=24 class=contig suggestRepeat=no suggestCircular=no

However, I'd still suggest you validate this assembly with orthogonal data (e.g. long pacbio CLR, nanopore, lab methods). Getting longer reads in general would also be helpful here as I mentioned before.

sneupane01 commented 4 years ago

Thank you very much. So, I will try this parameters with other five samples and keep you updated.

sneupane01 commented 4 years ago

@skoren. Thank you for the parameters. I am able to get circular contig for the three samples with same parameters.

However I am not being able to get circular for other three samples.I tried changing correctedErrorRate from 0.001 to 0.005. The value 0.0025 gave the required contig size but it is more than expected size and not circular.

canu -p wDi -d wDi -trim-assemble genomeSize=1.5m correctedErrorRate=0.0025 cnsErrorRate=0.050 minReadLength=3000 useGrid=false -pacbio-hifi reads1.fastq

Example:

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      7 sequences, total length 1175950 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  2750 sequences, total length 9932113 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1039365             1     1039365
--     20     1039365             1     1039365
--     30     1039365             1     1039365
--     40     1039365             1     1039365
--     50     1039365             1     1039365
--     60     1039365             1     1039365
--     70       71087             2     1110452
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      7 sequences, total length 1870024 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  2750 sequences, total length 15090775 bp.
--
-- Contig sizes based on genome size 1.5mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1647707             1     1647707
--     20     1647707             1     1647707
--     30     1647707             1     1647707
--     40     1647707             1     1647707
--     50     1647707             1     1647707
--     60     1647707             1     1647707
--     70     1647707             1     1647707
--     80     1647707             1     1647707
--     90     1647707             1     1647707
--    100     1647707             1     1647707
--    110      108666             2     1756373
--    120       51366             3     1807739

What other parameters can be helpful? Please let me know.

skoren commented 4 years ago

There's no guarantee all samples can be circularized, your reads lengths are very short. You can try sweeping the error rate together with the minimum read length and try running circlator or mummer to check for circularity yourself (in case the contig is circular but canu didn't mark it as such).

sneupane01 commented 4 years ago

@skoren Is it possible I send three read files and look upon it? Let me know. I would greatly appreciate it. I did not have faced so difficulty in assembling before. I have sent them. Thank You.

sneupane01 commented 4 years ago

@skoren I have one question regarding the output contig (obtained above) as suggested by CANU as circular. I polished the suggested circular contig using minimap2-racon. Now, should I have to use circlator or minimus2 to get circular? I used circlator and could not get the circular in the log file. I assume the contig is already circular. Please correct me.

skoren commented 4 years ago

Polishing for hifi data is not needed, the canu output is going to be over Q40, likely close to Q50 already. Haven't personally used circlator so not sure what it's requirements are. It's possible racon removed the ends of the contig that overlapped so it is no longer circular.

If you're going to polish I suggest using minipolish which should handle the circularization . You can also check circularity yourself for all your assemblies (including ones canu didn't mark as circular) using mummer, something like:

nucmer -maxmatch -nosimplify asm.fa asm.fa
show-coords -lrcTH out.delta

and look for matches at the start of the large contig to its end. Then just remove the overlapping coordinates yourself.

sneupane04 commented 4 years ago

@skoren These are the gfa files obtained from pacbio-hifi that suggested the required contig as circular. But I don't see the particular suggestive circular path in the bandage. Also, I ran mummer to the suggestive circular contig directly without polishing:

Sample4

1 2899 1515601 1518502 2899 2902 99.90 1518502 1518502 0.19 0.19 tig00000001 tig00000001 1515601 1518502 1 2899 2902 2899 99.90 1518502 1518502 0.19 0.19 tig00000001 tig00000001

So the required length is 1-1515601? Please suggest me.

For other two samples I could not find such overlap. I will install minipolish and update you.

gfafiles.zip

sneupane01 commented 4 years ago

@skoren How to use minipolish just for the suggestive circular contig without using miniasm assembly?

skoren commented 4 years ago
  1. Unfortunately in 1.9 and hifi data you cannot use the gfa output, it is a known issue.
  2. Yes, the trim you listed is correct.
  3. I haven't used minipolish but since you're able to use mummer to find the overlap and trim points you don't need it for your dataset, just trim the contig and don't polish. It's not going to make something circular if it is not.
sneupane01 commented 4 years ago

@skoren Thank You very much.