Closed rotoke closed 4 years ago
I think the parameters are reasonable and the report looks OK. The NG50 is also possibly not too bad, the report is using 2.3 Gbp as the genome size for the calculation so if you used 1.6 the NG50 would go up and it would also go up after purge_dups (confirm it picks a reasonable threshold when you run it). I wouldn't expect turning off haplotype separation would help.
The histogram has a single peak around 55-60x coverage, which would imply a 1.7 Gbp genome not a 2.3 Gbp genome (with a low amount of sequence in the homozygous part of the histogram). Given that plus the GenomeScope and very complete BUSCOs I think the flow cytometry may be off. It is possible there are some very repetitive sequences which got collapsed (and are also being collapsed by the genome scope analysis). You could try to look for this by looking at the coverage of the contigs output by Canu (the *contigs.layout.tigInfo
file has a coverage column) and trying to estimate collapsed bases (e.g. if a contig is at 100x but the median is 50x then it should have been 2 times the bases had it not been collapsed). I think you would be unlikely to assemble those given your relative short reads though so this may be as well as you can do.
Thank you very much for the fast and detailed reply Sergey!
Good to know that the assembly parameters are ok; I'll try to get new flow cytometry data, which will hopefully resolve the mystery. Does canu have an option to re-calculate the stats with different genome size estimates?
Regarding the estimation of collapsed bases in repeats: This is the head of the *contigs.layout.tigInfo
file (ordered according to coverage).
There are suggested repeats among the contigs, but not with such deep coverage:
If I understand correctly, the 'coverage' column shows the mean (or median?) base coverage per contig. This may be a stupid question, but where do I find the second value for the estimation (i.e. contig coverage after collapsing)? I assume this could be calculated by length('NumChildren')/'tigLen', but for this I'd need to know the individual length of each read?
Thank you again and all the best, Roman
I wouldn't rely on the repeat annotation. The coverage is the sum of bases in the reads divided by the tig length. Your first list has some deep contigs, for example assuming the median coverage of all your contigs is 50x, the first tig alone should represent approximately >1mb of sequence despite being only 100kb. So if you take all the contigs and multiply them by the ratio of their coverage to the median and sum them up and compare to their actual lengths, it will give you a sense of how many more bases you could have if all of those repeats were resolved.
Thank you very much for clarification. The median coverage of all contigs is only at 16.61x, so if I sum up the 'extra bases' of e.g. all the contigs with a three-digit coverage, I get an additional 247mbp, which would give an effective genome size of 2n~1.9gbp. I don't really know where to set the threshold though - If I sum up 'extra bases' for all reads above the median I get > 4gbp, which obviously makes no sense. I assume one could look into the fasta file and only include contigs with deep coverage that effectively contain repeats?
One last thing: I understand that I won't be able to assembly these repeats in full length given the short reads I have. However, I also increased repeat suppression (as in the FAQ) to save disk space during the assembly. Would setting this to default improve the resolution here?
Thank you again, Roman
To answer your question about changing the genome size, you can run:
tgStoreDump \
-S *.seqStore \
-T unitigging/*.ctgStore 2 \
-sizes -s 1700000000
for 1.7gb or any other size you want to provide as denominator.
I guess you want to get a median of the reliable contigs (e.g. those over some minimum length like 100kb or 1mb). The short low coverage things are probably skewing the median now. You can look at coverage vs length on a plot and see if there is a threshold that makes sense.
The repeat suppression in correction is probabilistic, unlike typical approaches. It will still find repeat overlaps, it will just very likely not pick repetitive seeds and the parameters decrease this likelihood further.
Dear Sergey,
I've plotted that quickly:
If I take e.g. 35x (the red line) as a threshold I end up with ~930mbp 'extra' and a total of ~2.59gbp, which is already relatively close to the expected range. Let's see how the new flow cytometry data will look like...
I'll close the issue now - thank you so much for all your input!
All the best, Roman
You're not trying to pick a cutoff like a histogram, you're looking for the coverage of most of your bases in the genome. It looks like the peak is slightly below 100x, it looks like maybe 60-75x. That is consistent with a smaller genome size (107269233773 / 70 = 1.5g). Then take all the tigs >100x and count the extra bases in them (assuming expected coverage of 70x), it looks like there isn't much there since they're all pretty short. This all seems consistent with the genomescope results and the assembly sizes so I think the cytometry data is over-estimating the size.
Ah sorry, that cutoff doesn't make any sense indeed. I'm still confused though, so please correct me where I'm wrong:
This is a histogram showing the total number of bases per coverage of 'reliable' tigs with ≥100kb length (I've binned all tigs ≥100kb in 5x-steps and summed up tigLength within each bin. The graph is cut at 200x): Here, the coverage of most bases is still at ~45x rather than 70x. I think the first plot is a bit misleading as the longest tigs are indeed within 60-75x, which seems to mask the true peak? This would give a genome size estimation of 107269233773/45 = 2.38gbp.
I indeed don't get much more if I sum up the extra bases of all tigs >100x, even if I'm using 45x as threshold (ca. 74 mbp).
When you say 2.3gb for the cytometry is that for both haplotypes (that is the human genome would be 6gb then) or is that the haploid size? I was interpreting it as the haploid size but reading your original message seems like it is both haplotypes. In that case, it doesn't seem so far off the assembly you have. The full genome isn't going to be separated since some regions have lower divergence than the overall 2.3% estimate so you shouldn't expect to get a 2.3g assembly (just like you don't get a 6gb human assembly from CLR data).
This plots the total bases at a given coverage? It seems like there may be two peaks around 40-50x and 80-90x which would make sense as a heterozygous and homozygous contigs. The longest contigs then would be homozygous and shared between the haplotypes but represented once. The total assembly size is 1.7g and, based on busco, it's missing about 25% of the second haplotype (duplicated 76%). How many bases do you have between the first peak and the second peak? Those are likely contigs that should be twice their length if they had been separated and I expect will account for about 300-400mb of sequence. All together, the pseudo-haplotype contigs + the collapsed homozygous contigs + the repeats will come close to the 2.3gb estimate you have I think.
Yes, 2.3 gbp is for 2n / the full diploid genome - I'm very sorry if that didn't come out clearly from the beginning. I didn't think the assembly size was completely off either, I was just wondering where the missing 30% got lost.
Yes, the plot shows total bases at a given 5x-wide coverage bin. I count around 170 mbp between the peaks, but this is probably an underestimation as the bins are relatively broad.
It sounds plausible that the missing 30% are a cumulative result of collapsing both homozygous segments and repeats. I think I'll take this assembly further (despite the lower NG50) and see what purge_dups, illumina polishing etc. will do with it.
Thank you again for all your help!
Nice work @rotoke , I see your pain in such complex genomes. If I can add my two cents here, I am facing a similar problem and I see some commonalities with your case.
I am assembling a 2.5 Gb (haploid size) highly heterozygous plant genome, with ~70x (haploid genome) ONT coverage, so 35ishx per allele. I had to ditch Canu due to the large size of the intermediate files and the long computation time on the resources I have, but Flye worked well (especially, no haplotype switches within contigs, which is what everyone should aim at) and gave me a 3.6 Gb assembly (out of 5 Gb), N50 680 kb, N90 130 kb, which I am quite satisfied of. After long and short read polishing, BUSCO score is 95% complete (29% single, 66% duplicated), so also good. To get a more granular view of what is collapsed and what not, I realigned the raw reads, calculated coverage in 10 kb windows and plotted. I get a bimodal curve with a smaller peak at 2x the coverage of the first, spanning about 400 Mb (which I would be very happy to get back). So now I am focusing on those (regions of the) contigs that are around that peak at higher coverage, trying to re-assemble them and get 2x as much sequence. Bad news is, nor Canu (also with the diploid settings), nor NECAT, nor Flye (--keep_haplotypes) can improve the total bp of the contigs I want to re-assemble. I am running out of options on how to try to split those sequences, but they may simply be too homozygous. @skoren : do you think that using --haplotypes and maybe some settings more stringent than default may help to get allele-specific corrected reads? How about duplicating (copy-paste) those collapsed (regions of) contigs, add them to the assembly, and align reads hoping they will split in two groups? Would the 10x chromium linked reads be of any help building 2 different haplotypes on the reference to make it more sticky to the long reads?
--haplotype
is for when you have trio information so it wouldn't help here. You shouldn't expect you can just assemble just these reads into two bins since they didn't get separated before. The raw reads are likely too noisy to discriminate haplotype differences vs random error. You could potentially use some other information to try to bin the reads, there are some tools that work with strandseq data. You could also try to make something to split using the 10x reads, this is the strategy used in the Korean reference genome paper: https://www.nature.com/articles/nature20098.
Dear Sergey,
I just finished a canu 2.0 assembly of a diploid 2.3 gbp pacbio plant genome, which turned out to be quite fragmented and much shorter than expected. There are some issues with genome features and read length, and it's clear that I won't end up with chromosome-level contigs. However, this is my first genome assembly, and I would be very happy to get your input on the canu parameter set and results before I start thinking about re-sequencing.
Organism
This is a diploid desert plant grown from wild collected seed. A GenomeScope run with Illumina data suggest ~2.6% heterozygosity, ~50% repeat content, and a genome size of 2n ~ 1.4 gbp. However, flow cytometry analysis suggests a genome size in the range of 2n ~2.3 gbp.
Data
I have ~60x PacBio Sequel I data. Unfortunately, the raw N50 is only around 8900bp with very few reads >10kbp. The sequencing company was unable to troubleshoot the run, so I had to take what I got.
Canu command
I used the suggested parameters for PacBio sequel I data and for high heterozygosity (separating haplotypes). The other parameters (suppressing repeats etc.) had to be added because of disk space issues and our weird cluster configuration.
Results
This is the complete .report file:
This is the output of an initial BUSCO v.4.0.6 run. I'll try to use purge_dups to get rid of duplication in a next step.
Questions
Did I make any errors in canu parameter choice, or are there any parameters I could tweak to improve assembly contiguity?
Do you have an explanation for the shorter assembly length? Maybe canu has collapsed some homozygous regions, or our flow cytometry data is off?
Could re-running the assembly without haplotype separation improve the result?
Thank you very much for your help!
All the best, Roman