Post-processing of uncollapsed assembly

marbl / canu

A single molecule sequence assembler for genomes large and small.

http://canu.readthedocs.io/

654 stars 179 forks source link

Post-processing of uncollapsed assembly #1814

Closed rotoke closed 3 years ago

rotoke commented 3 years ago

Dear Sergey,

I have a quick question about the post-processing of an uncollapsed assembly (haplotypes separated, described in #1755 )

My original plan was to do some iterations of arrow polishing, followed by pilon and then purge haplotigs with purge_dups. However, it seems like polishing polyploid assemblies with arrow would scramble up haplotypes, so I'm looking for alternatives:

Purge haplotigs first and then start with polishing. I tried both purge_dups and purge_haplotigs on my assembly with various parameter settings, but I can't get below ~16% duplication without losing genes.
Replace arrow with racon. It seems like racon works on higher ploidy assemblies, but is apparently slightly less accurate as arrow as it only uses base quality scores.

Do you have any recommendation on a sensible post-processing strategy for such assemblies?

Thank you very much! Roman

skoren commented 3 years ago

First, is your assembly based on CLR data or HiFi data, if it's HiFi data then I would suggest no polishing at all is necessary as the consensus will be more accurate than what you'd get with racon. If CLR, then I'd still suggest arrow as it's more accurate than racon but make sure to provide the entire assembly (not just the non-redundant contigs) for polishing.

As for purging, I recommend purge_dups since it can remove overlaps on the ends of contigs. Have you tried looking at the coverage graph it makes and picking manual thresholds for purging redundant contigs? That is typically what we need to do when it's not picking a good threshold automatically.

rotoke commented 3 years ago

Thank you for the fast reply!

Yes, this is CLR data, so I can't get around the polishing steps. If I understand correctly, you would first purge duplications and then somehow feed both files (i.e. the purged assembly and haplotigs) into arrow? Or first arrow, then purge_dups?

Unfortunately, the coverage graph didn't make threshold picking much easier: Cutoffs_purgedups_PB I manually tried several thresholds and sensitivity levels, and ended up with 5, 67, and 321, leaving ca. 16% duplication. Increasing the middle threshold or lowering the sensitivity rate further reduces the duplication to ca. 10%, but also decreases the number of complete BUSCOs from 97% to 93% Or is it more desirable to get rid of all duplication, even though some genes may be lost?

skoren commented 3 years ago

If the only steps you're doing is purging and arrow (no scaffolding), you can arrow the full asm first and then purge. If you plan to do scaffolding, it makes sense to do a round of arrow at the end to fill gaps in the scaffolding in which case you'd want to give arrow both the scaffolded primary and the alts as a combined fasta file and then split them back afterwards (the names don't change).

What is your expected genome coverage, is that main peak the haploid or diploid coverage? Generally you want to set the middle threshold to be at the end of the distribution for the haploid coverage (that is the separated parts of the assembly). For example, in the plot attached, the diploid coverage is 18 and haploid coverage is 36 so the peak is the haploid coverage and the little shoulder to the right is the diploid peak which is almost not visible. I think this is similar to your case in which case I would use a larger cutoff, probably close to 70. Your 67 cutoff seems reasonable. The high threshold can probably be around 200-220. It is also possible the extra duplication is because your genome has true duplication. You could try making the KAT/Merqury plots using a short-read dataset if you have one to see the copy counts of the k-mers in your assembly. That can be another way to confirm the duplication is being purged correctly or not.

rotoke commented 3 years ago

Thank you for the clarification! I (currently) don't have any data for scaffolding, so I will give Arrow and Pilon a try and then purge. I also realised that I have ~900 short transcripts that blast to insects and other contaminants, so I will remove them before polishing.

For example, in the plot attached, the diploid coverage is 18 and haploid coverage is 36 so the peak is the haploid coverage and the little shoulder to the right is the diploid peak

I'm sorry but I got a bit confused here - should not haploid be 18 and diploid 36? In my case, #bp raw reads / #bp raw assembly contigs (including bubbles) = ca. 58x, which I assume should then be the diploid peak (or 'almost diploid' as the haplotypes couldn't be disentangled everywhere). That wouldn't match very well with the peaks, but maybe there is a more accurate way of estimating coverage? (In #1755, we saw peaks at ~35-40x and 75-80x)

Considering the evolutionary history of the organism, it is quite likely that there is true duplication in the genome. I will try to produce KAT plots to control for overpurging. Would it make sense to remove all the 'bubble' contigs first to make purging more efficient (as in #1805) or would that also delete tigs that should be retained?

skoren commented 3 years ago

Yeah, I see how that sentence is confusing, I think I switched the terms in the middle. I mean 18x is the coverage of the 1-copy k-mers, that is the coverage of the full genome assuming it's 2n sized while 36 is the is coverage of the 2-copy k-mers, that is the coverage of the n size genome. The main peak are the 1-copy k-mers while the shoulder is the 2-copy k-mers. Hopefully that clarifies it.

I wouldn't rely on the assembly size to estimate coverage since it's likely a mix of 1-copy and 2-copy regions and so you'd need to double the bases in the 2-copy regions to get an accurate genome size. Rather, you can use something like genome scope or looking at the k-mer distribution from the Canu run to estimate a coverage and genome size.

We don't normally remove the bubbles before purging, purging can handle them and it isn't going to make much difference in runtime. It's possible purge_dups would have an easier time selecting thresholds if you did remove the bubbles since that will make the two peaks more pronounced (rather than just having the separated haplotypes peak like you do now). So you could try that to see what thresholds get selected and compare them to your manually chosen ones.

rotoke commented 3 years ago

Ah that makes sense, thank you for clarifying.

Concerning coverage estimation: I didn't manage to get the k-mers from the assembly into GenomeScope (the model did not converge), and looking at the canu k-mer graph (from the unitigging step) isn't very helpful either:

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2 330440903 ********************************************************************** 0.2342 0.0063
--       3-     4 208840295 ********************************************                           0.3300 0.0102
--       5-     7 103528234 *********************                                                  0.4155 0.0152
--       8-    11  57887643 ************                                                           0.4689 0.0201
--      12-    16  43044280 *********                                                              0.5037 0.0250
--      17-    22  42727638 *********                                                              0.5322 0.0307
--      23-    29  48655704 **********                                                             0.5623 0.0390
--      30-    37  56782635 ************                                                           0.5969 0.0515
--      38-    46  64687226 *************                                                          0.6372 0.0703
--      47-    56  67720327 **************                                                         0.6829 0.0967
--      57-    67  66879314 **************                                                         0.7305 0.1303
--      68-    79  62789644 *************                                                          0.7773 0.1699
--      80-    92  56141627 ***********                                                            0.8212 0.2138
--      93-   106  47127164 *********                                                              0.8603 0.2595
--     107-   121  37186596 *******                                                                0.8930 0.3037
--     122-   137  27599712 *****                                                                  0.9188 0.3434
--     138-   154  19782979 ****                                                                   0.9379 0.3768
--     155-   172  13661939 **                                                                     0.9516 0.4038
--     173-   191   9604923 **                                                                     0.9610 0.4246
--     192-   211   6916157 *                                                                      0.9677 0.4410
--     212-   232   5196147 *                                                                      0.9725 0.4541
--     233-   254   4067253                                                                        0.9761 0.4649
--     255-   277   3276660                                                                        0.9790 0.4743
--     278-   301   2696765                                                                        0.9813 0.4825
--     302-   326   2247858                                                                        0.9832 0.4899
--     327-   352   1912707                                                                        0.9848 0.4966
--     353-   379   1627394                                                                        0.9861 0.5028
--     380-   407   1404697                                                                        0.9872 0.5084
--     408-   436   1216785                                                                        0.9882 0.5137
--     437-   466   1063030                                                                        0.9891 0.5185
--     467-   497    944246                                                                        0.9898 0.5231
--     498-   529    834727                                                                        0.9905 0.5274
--     530-   562    741874                                                                        0.9911 0.5315
--     563-   596    670554                                                                        0.9916 0.5354
--     597-   631    604643                                                                        0.9921 0.5391
--     632-   667    548811                                                                        0.9925 0.5426
--     668-   704    497591                                                                        0.9929 0.5460
--     705-   742    456746                                                                        0.9933 0.5492
--     743-   781    418970                                                                        0.9936 0.5524
--     782-   821    387625                                                                        0.9939 0.5554

However, I also have Illumina reads from the same plant individual, which I fed into purge dups: I don't know whether this is technically correct at all, but my manual thresholds seem to match quite well with this data:

Cutoffs_purgedups_illumina

skoren commented 3 years ago

There's no reason to expect the thresholds for PacBio mapped data would be consistent with Illumina data, the two datasets likely have different coverage and thus different thresholds. You'd want to set completely new thresholds for the Illumina-based purging. The red line can definitely be shifted higher. It's again not clear which peak is the haploid and which is diploid (e.g. is the peak at 110x the separated part of the assembly while the peak at 220 the collapsed or is the 220 peak just repeats). You can try to see which is which by estimating genome size using those coverage (total bp in illumina / coverage) to see if it makes sense.

ASLeonard commented 3 years ago

Just offering my observations

We don't normally remove the bubbles before purging, purging can handle them and it isn't going to make much difference in runtime. It's possible purge_dups would have an easier time selecting thresholds if you did remove the bubbles since that will make the two peaks more pronounced (rather than just having the separated haplotypes peak like you do now). So you could try that to see what thresholds get selected and compare them to your manually chosen ones.

purge_dups was struggling before to identify peaks, and often chose something closer to 10x as the peak. After dropping all suggestedBubble=yes lines, purge_dups then identified the peak at 28x (while true coverage is probably ~27x).

There are some more assemblies underway, so I'll know more if this was a happy coincidence vs dropping bubbles enables more accurate use of purge_dups.

skoren commented 3 years ago

Idle, seems like removing bubbles before running purge_dups can be helpful to allow it to identify appropriate thresholds easier than manually setting them.

rotoke commented 3 years ago

Hello and sorry for the long silence. We decided to do Hi-C scaffolding on the assembly, and thus had to go back to the drawing board. So just to double-check:

If you plan to do scaffolding, it makes sense to do a round of arrow at the end to fill gaps in the scaffolding in which case you'd want to give arrow both the scaffolded primary and the alts as a combined fasta file and then split them back afterwards

If I understand correctly, the workflow you proposed would look as follows (in a nutshell)?: assembly - arrow of uni+haplotigs - purging - scaffolding of unitigs - arrow of scaffolds+haplotigs

Also, at which step would you include illumina polishing? Probably after scaffolding and the second round of arrow?

purge_dups was struggling before to identify peaks, and often chose something closer to 10x as the peak. After dropping all suggestedBubble=yes lines, purge_dups then identified the peak at 28x (while true coverage is probably ~27x).

I will try both strategies and will let you know what worked better with our data in the end.

Thank you very much for all your help!

skoren commented 3 years ago

Yes, your pipeline is correct, I'd polish with Illumina at the end after the second round of arrow. This follows the VGP pipeline for CLR data as well. There are some new ways to tweak the polishing, namely Merfin (https://github.com/arangrhie/merfin), which filters the arrow corrections but you probably would get decent results from just arrow.

rotoke commented 3 years ago

Great, thank you very much! I already did two rounds of arrow on the raw assembly, which slightly improved the BUSCO score from 97.0% to 97.2% while decreasing the illumina mapping error rate from 0.18% to 0.12%. I'll continue with purging and then send it off for scaffolding.

rotoke commented 3 years ago

Here are the results from purge_dups as promised. As suggested, I first estimated thresholds with and without bubbles:

Thresholds including bubbles: L5, M60, U180 Cutoffs_purgedups_lp2_withbubbles_PB

Thresholds excluding bubbles: L5, M67, U201 - these are very similar to the ones I ended up with my previous manual trials (L5, M67, U321) Cutoffs_purgedups_lrp2_nobubbles_PB

I then tried purge_dups with both thresholds (stats and KAT plots attached):

Assembly before purging: 37336 tigs, of which 24600 suggested bubbles BUSCO: C:97.2%[S:21.2%,D:76.0%],F:0.5%,M:2.3%,n:2326 lp2_unpurged-main mx spectra-cn_corrected

Full assembly purged with L5, M60, U180: 6123 tigs, of which 2003 suggested bubbles BUSCO: C:95.5%[S:70.9%,D:24.6%],F:0.7%,M:3.8%,n:2326 withbubbles_l5m60u180-main mx spectra-cn_corrected

Full assembly purged with L5, M67, U201: 6096 tigs, of which 2007 suggested bubbles BUSCO: C:95.4%[S:78.7%,D:16.7%],F:0.6%,M:4.0%,n:2326 withbubbles_l5m67u201_nomiddle-main mx spectra-cn_corrected

Observations: At least for this data, estimating purging thresholds with bubbles removed seems to slightly improve the purging, but at the expense of losing a few more BUSCOs. However, there are still ca. 2000 suggested bubbles left in both assemblies after purging. Also, both assemblies show signs of residual duplication as well as overpurging (if I read the KAT plots correctly).

To see whether this makes a difference, I manually removed all suggested bubbles and re-did the purging:

Assembly with suggested bubbles removed, then purged with L5, M67, U201: 4271 tigs, of which 0 suggested bubbles BUSCO: C:95.2%[S:77.3%,D:17.9%],F:0.7%,M:4.1%,n:2326 nobubbles_l5m67u201-main mx spectra-cn_corrected

The BUSCO stats are slightly worse here than before, but the number of tigs is greatly reduced and there is slightly less overpurging (although still present). Alternatively, one could obviously also purge the full assembly and then remove the remaining bubbles.

I guess I can't do much about the loss of BUSCOs and the residual duplication / overpurging, but would you recommend to manually remove suggested bubbles before (or after) the purging if purge_dups doesn't filter them out completely?

Thanks a lot, and happy new year!

rotoke commented 3 years ago

Update: I manually removed the bubbles that remained in the assembly after purging and recalculated all BUSCO scores:

Purged without remaining bubbles removed (6096 tigs): C:95.4%[S:78.7%,D:16.7%],F:0.6%,M:4.0%,n:2326

Bubbles removed before purging: (4271 tigs): C:95.2%[S:77.3%,D:17.9%],F:0.7%,M:4.1%,n:2326

Bubbles removed after purging: (4089 tigs): C:94.3%[S:78.5%,D:15.8%],F:0.8%,M:4.9%,n:2326

So it seems like removing bubbles before purging results in slightly more conservative purging than purging the full assembly and removing the residual bubbles afterwards. However, both results are pretty similar to the one without bubbles removed.

Considering that remaining bubbles make up about 1/3 of the purged assembly, would you still remove them before scaffolding?

skoren commented 3 years ago

I wouldn't remove bubbles after purging, you let purge_dups define bubbles at that point already so what Canu has been considering a bubble may have been promoted to primary by purge_dups by removing some other sequence. Thus, you could now remove the only representative of some loci (which I would guess is why the completeness is lowest for that case). From the stats, It seems like removing bubbles then purging is the best strategy.

rotoke commented 3 years ago

Thank you for the quick and helpful reply!