Closed rotoke closed 3 years ago
First, is your assembly based on CLR data or HiFi data, if it's HiFi data then I would suggest no polishing at all is necessary as the consensus will be more accurate than what you'd get with racon. If CLR, then I'd still suggest arrow as it's more accurate than racon but make sure to provide the entire assembly (not just the non-redundant contigs) for polishing.
As for purging, I recommend purge_dups since it can remove overlaps on the ends of contigs. Have you tried looking at the coverage graph it makes and picking manual thresholds for purging redundant contigs? That is typically what we need to do when it's not picking a good threshold automatically.
Thank you for the fast reply!
Yes, this is CLR data, so I can't get around the polishing steps. If I understand correctly, you would first purge duplications and then somehow feed both files (i.e. the purged assembly and haplotigs) into arrow? Or first arrow, then purge_dups?
Unfortunately, the coverage graph didn't make threshold picking much easier: I manually tried several thresholds and sensitivity levels, and ended up with 5, 67, and 321, leaving ca. 16% duplication. Increasing the middle threshold or lowering the sensitivity rate further reduces the duplication to ca. 10%, but also decreases the number of complete BUSCOs from 97% to 93% Or is it more desirable to get rid of all duplication, even though some genes may be lost?
If the only steps you're doing is purging and arrow (no scaffolding), you can arrow the full asm first and then purge. If you plan to do scaffolding, it makes sense to do a round of arrow at the end to fill gaps in the scaffolding in which case you'd want to give arrow both the scaffolded primary and the alts as a combined fasta file and then split them back afterwards (the names don't change).
What is your expected genome coverage, is that main peak the haploid or diploid coverage? Generally you want to set the middle threshold to be at the end of the distribution for the haploid coverage (that is the separated parts of the assembly). For example, in the plot attached, the diploid coverage is 18 and haploid coverage is 36 so the peak is the haploid coverage and the little shoulder to the right is the diploid peak which is almost not visible. I think this is similar to your case in which case I would use a larger cutoff, probably close to 70. Your 67 cutoff seems reasonable. The high threshold can probably be around 200-220. It is also possible the extra duplication is because your genome has true duplication. You could try making the KAT/Merqury plots using a short-read dataset if you have one to see the copy counts of the k-mers in your assembly. That can be another way to confirm the duplication is being purged correctly or not.
Thank you for the clarification! I (currently) don't have any data for scaffolding, so I will give Arrow and Pilon a try and then purge. I also realised that I have ~900 short transcripts that blast to insects and other contaminants, so I will remove them before polishing.
For example, in the plot attached, the diploid coverage is 18 and haploid coverage is 36 so the peak is the haploid coverage and the little shoulder to the right is the diploid peak
I'm sorry but I got a bit confused here - should not haploid be 18 and diploid 36? In my case, #bp raw reads / #bp raw assembly contigs (including bubbles) = ca. 58x, which I assume should then be the diploid peak (or 'almost diploid' as the haplotypes couldn't be disentangled everywhere). That wouldn't match very well with the peaks, but maybe there is a more accurate way of estimating coverage? (In #1755, we saw peaks at ~35-40x and 75-80x)
Considering the evolutionary history of the organism, it is quite likely that there is true duplication in the genome. I will try to produce KAT plots to control for overpurging. Would it make sense to remove all the 'bubble' contigs first to make purging more efficient (as in #1805) or would that also delete tigs that should be retained?
Yeah, I see how that sentence is confusing, I think I switched the terms in the middle. I mean 18x is the coverage of the 1-copy k-mers, that is the coverage of the full genome assuming it's 2n sized while 36 is the is coverage of the 2-copy k-mers, that is the coverage of the n size genome. The main peak are the 1-copy k-mers while the shoulder is the 2-copy k-mers. Hopefully that clarifies it.
I wouldn't rely on the assembly size to estimate coverage since it's likely a mix of 1-copy and 2-copy regions and so you'd need to double the bases in the 2-copy regions to get an accurate genome size. Rather, you can use something like genome scope or looking at the k-mer distribution from the Canu run to estimate a coverage and genome size.
We don't normally remove the bubbles before purging, purging can handle them and it isn't going to make much difference in runtime. It's possible purge_dups would have an easier time selecting thresholds if you did remove the bubbles since that will make the two peaks more pronounced (rather than just having the separated haplotypes peak like you do now). So you could try that to see what thresholds get selected and compare them to your manually chosen ones.
Ah that makes sense, thank you for clarifying.
Concerning coverage estimation: I didn't manage to get the k-mers from the assembly into GenomeScope (the model did not converge), and looking at the canu k-mer graph (from the unitigging step) isn't very helpful either:
[UNITIGGING/MERS]
--
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 0 0.0000 0.0000
-- 2- 2 330440903 ********************************************************************** 0.2342 0.0063
-- 3- 4 208840295 ******************************************** 0.3300 0.0102
-- 5- 7 103528234 ********************* 0.4155 0.0152
-- 8- 11 57887643 ************ 0.4689 0.0201
-- 12- 16 43044280 ********* 0.5037 0.0250
-- 17- 22 42727638 ********* 0.5322 0.0307
-- 23- 29 48655704 ********** 0.5623 0.0390
-- 30- 37 56782635 ************ 0.5969 0.0515
-- 38- 46 64687226 ************* 0.6372 0.0703
-- 47- 56 67720327 ************** 0.6829 0.0967
-- 57- 67 66879314 ************** 0.7305 0.1303
-- 68- 79 62789644 ************* 0.7773 0.1699
-- 80- 92 56141627 *********** 0.8212 0.2138
-- 93- 106 47127164 ********* 0.8603 0.2595
-- 107- 121 37186596 ******* 0.8930 0.3037
-- 122- 137 27599712 ***** 0.9188 0.3434
-- 138- 154 19782979 **** 0.9379 0.3768
-- 155- 172 13661939 ** 0.9516 0.4038
-- 173- 191 9604923 ** 0.9610 0.4246
-- 192- 211 6916157 * 0.9677 0.4410
-- 212- 232 5196147 * 0.9725 0.4541
-- 233- 254 4067253 0.9761 0.4649
-- 255- 277 3276660 0.9790 0.4743
-- 278- 301 2696765 0.9813 0.4825
-- 302- 326 2247858 0.9832 0.4899
-- 327- 352 1912707 0.9848 0.4966
-- 353- 379 1627394 0.9861 0.5028
-- 380- 407 1404697 0.9872 0.5084
-- 408- 436 1216785 0.9882 0.5137
-- 437- 466 1063030 0.9891 0.5185
-- 467- 497 944246 0.9898 0.5231
-- 498- 529 834727 0.9905 0.5274
-- 530- 562 741874 0.9911 0.5315
-- 563- 596 670554 0.9916 0.5354
-- 597- 631 604643 0.9921 0.5391
-- 632- 667 548811 0.9925 0.5426
-- 668- 704 497591 0.9929 0.5460
-- 705- 742 456746 0.9933 0.5492
-- 743- 781 418970 0.9936 0.5524
-- 782- 821 387625 0.9939 0.5554
However, I also have Illumina reads from the same plant individual, which I fed into purge dups: I don't know whether this is technically correct at all, but my manual thresholds seem to match quite well with this data:
There's no reason to expect the thresholds for PacBio mapped data would be consistent with Illumina data, the two datasets likely have different coverage and thus different thresholds. You'd want to set completely new thresholds for the Illumina-based purging. The red line can definitely be shifted higher. It's again not clear which peak is the haploid and which is diploid (e.g. is the peak at 110x the separated part of the assembly while the peak at 220 the collapsed or is the 220 peak just repeats). You can try to see which is which by estimating genome size using those coverage (total bp in illumina / coverage) to see if it makes sense.
Just offering my observations
We don't normally remove the bubbles before purging, purging can handle them and it isn't going to make much difference in runtime. It's possible purge_dups would have an easier time selecting thresholds if you did remove the bubbles since that will make the two peaks more pronounced (rather than just having the separated haplotypes peak like you do now). So you could try that to see what thresholds get selected and compare them to your manually chosen ones.
purge_dups was struggling before to identify peaks, and often chose something closer to 10x as the peak. After dropping all suggestedBubble=yes
lines, purge_dups then identified the peak at 28x (while true coverage is probably ~27x).
There are some more assemblies underway, so I'll know more if this was a happy coincidence vs dropping bubbles enables more accurate use of purge_dups.
Idle, seems like removing bubbles before running purge_dups can be helpful to allow it to identify appropriate thresholds easier than manually setting them.
Hello and sorry for the long silence. We decided to do Hi-C scaffolding on the assembly, and thus had to go back to the drawing board. So just to double-check:
If you plan to do scaffolding, it makes sense to do a round of arrow at the end to fill gaps in the scaffolding in which case you'd want to give arrow both the scaffolded primary and the alts as a combined fasta file and then split them back afterwards
If I understand correctly, the workflow you proposed would look as follows (in a nutshell)?: assembly - arrow of uni+haplotigs - purging - scaffolding of unitigs - arrow of scaffolds+haplotigs
Also, at which step would you include illumina polishing? Probably after scaffolding and the second round of arrow?
purge_dups was struggling before to identify peaks, and often chose something closer to 10x as the peak. After dropping all suggestedBubble=yes lines, purge_dups then identified the peak at 28x (while true coverage is probably ~27x).
I will try both strategies and will let you know what worked better with our data in the end.
Thank you very much for all your help!
Yes, your pipeline is correct, I'd polish with Illumina at the end after the second round of arrow. This follows the VGP pipeline for CLR data as well. There are some new ways to tweak the polishing, namely Merfin (https://github.com/arangrhie/merfin), which filters the arrow corrections but you probably would get decent results from just arrow.
Great, thank you very much! I already did two rounds of arrow on the raw assembly, which slightly improved the BUSCO score from 97.0% to 97.2% while decreasing the illumina mapping error rate from 0.18% to 0.12%. I'll continue with purging and then send it off for scaffolding.
Here are the results from purge_dups as promised. As suggested, I first estimated thresholds with and without bubbles:
Thresholds including bubbles: L5, M60, U180
Thresholds excluding bubbles: L5, M67, U201 - these are very similar to the ones I ended up with my previous manual trials (L5, M67, U321)
I then tried purge_dups with both thresholds (stats and KAT plots attached):
Assembly before purging: 37336 tigs, of which 24600 suggested bubbles BUSCO: C:97.2%[S:21.2%,D:76.0%],F:0.5%,M:2.3%,n:2326
Full assembly purged with L5, M60, U180: 6123 tigs, of which 2003 suggested bubbles BUSCO: C:95.5%[S:70.9%,D:24.6%],F:0.7%,M:3.8%,n:2326
Full assembly purged with L5, M67, U201: 6096 tigs, of which 2007 suggested bubbles BUSCO: C:95.4%[S:78.7%,D:16.7%],F:0.6%,M:4.0%,n:2326
Observations: At least for this data, estimating purging thresholds with bubbles removed seems to slightly improve the purging, but at the expense of losing a few more BUSCOs. However, there are still ca. 2000 suggested bubbles left in both assemblies after purging. Also, both assemblies show signs of residual duplication as well as overpurging (if I read the KAT plots correctly).
To see whether this makes a difference, I manually removed all suggested bubbles and re-did the purging:
Assembly with suggested bubbles removed, then purged with L5, M67, U201: 4271 tigs, of which 0 suggested bubbles BUSCO: C:95.2%[S:77.3%,D:17.9%],F:0.7%,M:4.1%,n:2326
The BUSCO stats are slightly worse here than before, but the number of tigs is greatly reduced and there is slightly less overpurging (although still present). Alternatively, one could obviously also purge the full assembly and then remove the remaining bubbles.
I guess I can't do much about the loss of BUSCOs and the residual duplication / overpurging, but would you recommend to manually remove suggested bubbles before (or after) the purging if purge_dups doesn't filter them out completely?
Thanks a lot, and happy new year!
Update: I manually removed the bubbles that remained in the assembly after purging and recalculated all BUSCO scores:
Purged without remaining bubbles removed (6096 tigs):
C:95.4%[S:78.7%,D:16.7%],F:0.6%,M:4.0%,n:2326
Bubbles removed before purging: (4271 tigs):
C:95.2%[S:77.3%,D:17.9%],F:0.7%,M:4.1%,n:2326
Bubbles removed after purging: (4089 tigs):
C:94.3%[S:78.5%,D:15.8%],F:0.8%,M:4.9%,n:2326
So it seems like removing bubbles before purging results in slightly more conservative purging than purging the full assembly and removing the residual bubbles afterwards. However, both results are pretty similar to the one without bubbles removed.
Considering that remaining bubbles make up about 1/3 of the purged assembly, would you still remove them before scaffolding?
I wouldn't remove bubbles after purging, you let purge_dups define bubbles at that point already so what Canu has been considering a bubble may have been promoted to primary by purge_dups by removing some other sequence. Thus, you could now remove the only representative of some loci (which I would guess is why the completeness is lowest for that case). From the stats, It seems like removing bubbles then purging is the best strategy.
Thank you for the quick and helpful reply!
Dear Sergey,
I have a quick question about the post-processing of an uncollapsed assembly (haplotypes separated, described in #1755 )
My original plan was to do some iterations of arrow polishing, followed by pilon and then purge haplotigs with purge_dups. However, it seems like polishing polyploid assemblies with arrow would scramble up haplotypes, so I'm looking for alternatives:
Purge haplotigs first and then start with polishing. I tried both purge_dups and purge_haplotigs on my assembly with various parameter settings, but I can't get below ~16% duplication without losing genes.
Replace arrow with racon. It seems like racon works on higher ploidy assemblies, but is apparently slightly less accurate as arrow as it only uses base quality scores.
Do you have any recommendation on a sensible post-processing strategy for such assemblies?
Thank you very much! Roman