mapleforest / HaploMerger2

40 stars 6 forks source link

N50 of assembly #2

Open danshu opened 7 years ago

danshu commented 7 years ago

Hi,

As mentioned in the manual, "HaploMerger2 is not suitable to work on the assemblies with a scaffold N50 size <50Kb; it barely works when the scaffold 100Kb>N50>50Kb; it works better when the scaffold N50>100Kb; and it works much better when scaffold N50>150Kb.". I want to know how do you lead to this conclusion? As far as I know, N50 is just one aspect of assembly quality and when comparing N50, genome size is also important. An assembly with N50=50kb of a larger genome (e.g. 3Gb) will be more fragmented than the assembly with N50=50kb of a smaller genome (e.g. 300Mb). So for different genome sizes, N50=50kb means different qualities and are those not a fair standard.

Best, Danshu

mapleforest commented 7 years ago

Dear Danshu,

HM2 is based on the long-term sensitive alignments created by using lastz (a blast-level sensitivity). Therefore, if there are many short scaffolds/contigs, there will be two problems:

  1. the number of false positive alignments rises,
  2. the improvement of the continuity is not very impressive.

The 50kb threshold is empirical.

However, unlike HM1, HM2 output both haploid assemblies and will not lose information (provided that you do not remove tandem mis-assemblies), so if you have an assembly with an N50 scaffold size >50kb, you can try HM2 anyway, just be careful of the result.

And no matter how the N50 size is, you can still use HM2 to evaluate the haploid assembly.

Another way, before using HM2, you may want to first use some haplotype-aware de novo assemblers to improve the assembly as much as possible.

Best regards, Shengfeng.

danshu commented 7 years ago

Thanks! Do you mean that for lastz_D, soft-mask and hard-mask are both OK?

I noticed that in stage 2, there is also "breakingMode=2" parameter in hm.batchB3.haplomerger. So misjoins will also be removed in stage 2?

Best, Danshu

mapleforest commented 7 years ago

Dear Danshu,

We treat macro-misjions in batchA and micro-misjoins in batchB. The reason to do so is that micro-misjoins are more likely structural variations.

However, you may run batchA for several rounds with decreasing flanking sequence length in order to remove more misjoins, like 50kb, 40kb and 30kb or even 10kb.

The broken-up scaffolds will be re-joined in the later scaffolding stage.

best regards, Shengfeng.

danshu commented 7 years ago

Thanks!