Open gunjanpandey opened 2 days ago
The larger size is expected, it's likely both haplotypes of a diploid genome (see https://canu.readthedocs.io/en/latest/faq.html#my-genome-size-and-assembly-size-are-different-help). You can see that about 500mb are already flagged as bubbles (alt haplotype). The rest likely is too diverged to be automatically flagged so you'd need to rely on a tool like purge_dups. As for the fragmentation, the coverage looks really low from the k-mer histogram. The primary peak is between 6-10x which is too low for a good assembly, what coverage were you inputting? Is this a clonal sample or a collection of individuals?
Thanks for a prompt reply, Sergey
This genome has puzzled me quite a bit. Total input hifi data is ~60X (assuming ~1.2 G genome size, which could be around 2G)
genomescope profile of the same organism with the short read data is here https://github.com/schatzlab/genomescope/issues/142
file format type num_seqs sum_len min_len avg_len max_len
../01_Data/hifi_dedup_decontamianted.fq FASTQ DNA 4,122,639 73,888,444,677 90 17,922.6 63,566
Note this is a Cladocopium app where the polidy and duplication levels are not clear. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9412976/
Any thoughts on how to proceed would be very useful to me.
The genomescope results imply a larger genome than 1.2 Gbp but also that the haplotypes are extremely similar (if it is diploid) as there are very few single-copy k-mers. You'd probably benefit from a larger k-mer size like k=31 instead of 19 for genomescope.
The HiFi assembly implies an even larger genome size, the coverage is somewhere around 8x given 50x 1.2gb or 7gb which would imply a 3.5gb if diploid genome. HiFi assembly is going to be very sensitive to variation though so it makes me wonder if the inputs for the Illumina and HiFi data are the same? Is it possible the Illumina sample is more clonal than the sample for HiFi? Either way, I'd increase either the genome size or the maxInputCoverage since right now it's only use 50x 1.2 gb so you have more data that was not used in the assembly. After that, your best option is probably to rely on core genes/purge_dups to determine if there is haplotype duplication in the assembly or not. You could also try verkko and look at the resulting assembly graphs to see if there is diploid structure (though it would likely be less continuous as it only produces phased outputs while canu can produce a pseudo-haplotype).
Could you please tell your interpretation of this log file for a algae assembly attept and how to improve assembly contiguity for this highly heterogygous algal genome?
It is canu 2.2.
canu -assemble -p algae -d ./ genomeSize=1.2g -pacbio-hifi ../01_Data/hifi_decontamianted.fq useGrid=true gridOptions="--time=02-00:00:00 "
The assembly stat is below for the reference. Note that the assembly size is quite large as the expected genome size is around 1.2G.
Thanks a lot in advance.