marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Genome duplication(2.4x) when assembling using CCS/Hifi reads #1624

Closed mbhall88 closed 4 years ago

mbhall88 commented 4 years ago

Moving this into its own issue. I am attempting to assemble a Mycobacterium tuberculosis sample (which I actually have Illumina, nanopore, and CCS for). My expected genome size is ~4.4Mb, but the assembly I get from Canu ends up with 332 contigs, and a genome size of ~10.7Mb (2.424 duplication). The Canu assembly I get from the nanopore reads is 31 contigs and a genome size ~6Mb (1.36 duplication). In contrast, using Flye with the CCS data I get a single contig that is very close to my expected genome size and for ONT data I get 3 contigs slightly bigger than expected genome size.

Mtb has a fairly GC-rich genome (~65%) and has some fairly repetitive regions. Given this, I also tried the assembly with Canu setting corMaxEvidenceErate=0.15 (as recommended in the docs), but it didn't make any noticeable difference.

I am working off https://github.com/marbl/canu/commit/7ae7a097fab4819ab4d29bdcce0f38aa744f571c

Originally posted by @mbhall88 in https://github.com/marbl/canu/issues/1586#issuecomment-587388733

mbhall88 commented 4 years ago

As requested here by @skoren , this is my log file. If you also want the data, let me know and I am happy to provide that too.

canu.log

I will also add, in response to https://github.com/marbl/canu/issues/1586#issuecomment-587482471

.... Your input is likely not clonal and the larger genome size is due to Canu separating the variants present in the sample. I would guess there is one large contig in the Canu assembly which is close to the genome size with extra contigs representing variants, some of which may be marked as bubbles. We recommend using purge_dups after a HiFi assembly, normally this is only needed for eukaryotes or diploid samples but is also necessary for non-clonal bacterial samples.

I am quite certain that this sample is clonal.

skoren commented 4 years ago

The assembly is 4.7mb, with a single 4.4mb contig which is your main chromosome. The rest is in bubbles which are likely rare variants. Even though the sample is clonal, there are still likely to be rare variants, in hifi mode canu is sensitive down to 1 in 10kbp. In contrast, Flye would only be sensitive down to 2 or 3 in 100 bp. The extra 300kb may be plasmid or difference so diverged they didn't get flagged as bubbles. Since you're getting more than the genome size with ONT data, I expect there may be some variation in the sample. You can remove all the bubbles from the output by ignoring entries marked as "suggestedBubble=yes". You could also randomly downsample the data to 30 or 40x which sufficiently downweight some of the variants.

-- Found, in version 2, after consensus generation:
--   contigs:      21 sequences, total length 4753054 bp (including 3 repeats of total length 24716 bp).
--   bubbles:      311 sequences, total length 5959191 bp.
--   unassembled:  2405 sequences, total length 10571342 bp.
--
-- Contig sizes based on genome size 4.411532mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     4427553             1     4427553
--     20     4427553             1     4427553
--     30     4427553             1     4427553
--     40     4427553             1     4427553
--     50     4427553             1     4427553
--     60     4427553             1     4427553
--     70     4427553             1     4427553
--     80     4427553             1     4427553
--     90     4427553             1     4427553
--    100     4427553             1     4427553
--
mbhall88 commented 4 years ago

Thank you for the thorough support @skoren - I really appreciate it.

Mtb doesn't have any plasmids. However, it has a repertoire of genes call PE/PPE:

Approximately 10 % of the Mycobacterium tuberculosis genome is made up of two families of genes that are poorly characterized due to their high GC content and highly repetitive nature. The PE and PPE families are typified by their highly conserved N-terminal domains that incorporate proline-glutamate (PE) and proline-proline-glutamate (PPE) signature motifs. They are hypothesised to be important virulence factors involved with host-pathogen interactions, but their high genetic variability and complexity of analysis means they are typically disregarded in genome studies.

Some of these genes are very similar so I wonder if maybe they are contributing to this somehow.

I will try rerunning with the suggestedBubbles parameter you suggested. :pray:

skoren commented 4 years ago

The bubbles isn't a parameter, it is already annotated in your contigs.fasta file, just look for it in the header line and skip that contig if it is a bubble so no need to re-run.

If you do re-run, you can try randomly downsampling to 30-40x (just take the first few reads from the fastq file) which I expect might help some of these too, this will be the default coverage used soon.