mobinasri / flagger

Evaluating genome assemblies
MIT License
54 stars 4 forks source link

A few questions about the WDL pipeline #5

Closed GuillaumeHolley closed 1 year ago

GuillaumeHolley commented 2 years ago

Hi,

I skimmed through the WDL files recently and I have a handful of questions about the pipeline:

Thank you for your help. Guillaume

GuillaumeHolley commented 2 years ago

I have another question if you don't mind :) Could you point me to the CHM13's HSAT BED files you are using in input of the pipeline? The only BED file I could find is the satellites coordinates in the centromeres of CHM13 (which contains some HSAT) but I have no idea if this is what you are using too.

mobinasri commented 2 years ago

Hi @GuillaumeHolley Thanks for looking into it in detail.

mobinasri commented 2 years ago

These are the HSat coordinates I'm using as inputs. (including only the arrays longer than 1mb) https://storage.googleapis.com/masri/hprc/chm13_hsat_beds/t2t_cenAnnotation.v1.1.hsat1_gt_1mb.bed https://storage.googleapis.com/masri/hprc/chm13_hsat_beds/t2t_cenAnnotation.v1.1.hsat2_gt_1mb.bed https://storage.googleapis.com/masri/hprc/chm13_hsat_beds/t2t_cenAnnotation.v1.1.hsat3_gt_1mb.bed They are all extracted from the censat annotation provided by the T2T consortium.
http://public.gi.ucsc.edu/~khmiga/t2t_cenAnnotation.v1.1.bed

GuillaumeHolley commented 1 year ago

Hi @mobinasri ,

Thank you for you very thorough answer. So I think I got almost everything right with running the Flagger pipeline on my side. I picked the same annotations as you for the HSAT but I used CHM13 v2 rather than v1.1 which should be good too. However, when running Flagger on my assemblies, I used all HSAT annotations rather than just the ones which are 1MB long. I imagine this 1MB length is due to the inability to obtain a smooth coverage distribution on shorter regions? Do you think this is going to be a problem for the final output of Flagger (using all HSAT rather than the 1Mbp+ long ones)?

After running Flagger on my assemblies, I get around 50 Mbp of regions which are unreliable (collapsed, duplicated, erroneous and unknown) per assembly and on average which seems to be roughly in line with the number in the HPRC graph paper. However, if I filter my assemblies by keeping only the sub-contigs which are reliable (haploid), it breaks the contiguity of the assemblies quite a bit. Now, Minigraph only uses by default contigs which are at least 100 kb long when adding assemblies to a graph. For any of my assemblies, it seems that only 65% of the contigs are at least 100 kb long which means Minigraph would discard 35% of each assembly on average. In the HPRC Minigraph or Minigraph-Cactus graphs, do you know if the assemblies were added before or after Flagger-filtering?

Thank you! Guillaume

mobinasri commented 1 year ago

No assembly block was filtered using Flagger. It was mainly used for comparing with chm13v1.1. Regarding your point about small HSat blocks. I think it is better to include shorter HSat blocks and make a single coverage file per HSat. I recently observed that including HSat blocks only larger than 1mb may not be enough since shorter blocks may also be susceptible to coverage biases.

GuillaumeHolley commented 1 year ago

Hi @mobinasri,

Thank you for the feedback, it is very much appreciated :) I think I've got everything I need so I'll close the issue.

Guillaume