A few questions about the WDL pipeline

GuillaumeHolley commented 2 years ago

Hi,

I skimmed through the WDL files recently and I have a handful of questions about the pipeline:

The fit_model steps seem to have a hard-coded coverage of 20 used as parameter of fit_gmm.py. Shouldn't it use the output of calc_mean_sd.py instead?
A few of the steps converting .cov files to .bed use the following awk instruction: if(substr($1,1,1) == ">") {contig=substr($1,2,40);. This is assuming that the assembly has contig names no longer than 40 characters?
The command split_cov_by_window creating coverage files for each window of 5mbp outputs a "summary" file windows.txt. This file seems to have a BED file structure from what I saw: every line is contig name, start position and end position. However, in task findBlocksByWindow, this file is parsed and filtered using the expression cat ~{windowsText} | awk '~{minWindowSize} <= $3'. Using this, it seems that the 3rd field of each line in windows.txt is the length of the window rather than the end position in the contig?

Thank you for your help. Guillaume

GuillaumeHolley commented 2 years ago

I have another question if you don't mind :) Could you point me to the CHM13's HSAT BED files you are using in input of the pipeline? The only BED file I could find is the satellites coordinates in the centromeres of CHM13 (which contains some HSAT) but I have no idea if this is what you are using too.

mobinasri commented 2 years ago

Hi @GuillaumeHolley Thanks for looking into it in detail.

The default value for the initial coverage is 20. This value is the initial point of the EM process (expectation maximization). This value may not be exact, just a rough estimate can help EM to converge faster and avoid local convergence. Although there is a default value for it in fit_model.wdl you are required to pass a value in the outer workflow flagger.wdl. As you mentioned correctly you can use the output of calc_mean_sd.py for finding a reasonable value.
Good point about the length of contig names (at most 40 chars). Should remove it later to avoid silly bugs.
The third column is indeed the end position. split_cov_by_window merges the last two windows of a contig if the last one is shorter than the given size (5mb in this case). For example if we have a contig of length 17mb, we would have three windows with sizes 5, 5, and 7. Using awk '~{minWindowSize} <= $3' is for filtering short windows, which can happen only for contigs shorter than ~{minWindowSize}. In other words it excludes contigs shorter than ~{minWindowSize}.

mobinasri commented 2 years ago

These are the HSat coordinates I'm using as inputs. (including only the arrays longer than 1mb) https://storage.googleapis.com/masri/hprc/chm13_hsat_beds/t2t_cenAnnotation.v1.1.hsat1_gt_1mb.bed https://storage.googleapis.com/masri/hprc/chm13_hsat_beds/t2t_cenAnnotation.v1.1.hsat2_gt_1mb.bed https://storage.googleapis.com/masri/hprc/chm13_hsat_beds/t2t_cenAnnotation.v1.1.hsat3_gt_1mb.bed They are all extracted from the censat annotation provided by the T2T consortium.
http://public.gi.ucsc.edu/~khmiga/t2t_cenAnnotation.v1.1.bed

GuillaumeHolley commented 1 year ago

Hi @mobinasri ,

Thank you for you very thorough answer. So I think I got almost everything right with running the Flagger pipeline on my side. I picked the same annotations as you for the HSAT but I used CHM13 v2 rather than v1.1 which should be good too. However, when running Flagger on my assemblies, I used all HSAT annotations rather than just the ones which are 1MB long. I imagine this 1MB length is due to the inability to obtain a smooth coverage distribution on shorter regions? Do you think this is going to be a problem for the final output of Flagger (using all HSAT rather than the 1Mbp+ long ones)?

After running Flagger on my assemblies, I get around 50 Mbp of regions which are unreliable (collapsed, duplicated, erroneous and unknown) per assembly and on average which seems to be roughly in line with the number in the HPRC graph paper. However, if I filter my assemblies by keeping only the sub-contigs which are reliable (haploid), it breaks the contiguity of the assemblies quite a bit. Now, Minigraph only uses by default contigs which are at least 100 kb long when adding assemblies to a graph. For any of my assemblies, it seems that only 65% of the contigs are at least 100 kb long which means Minigraph would discard 35% of each assembly on average. In the HPRC Minigraph or Minigraph-Cactus graphs, do you know if the assemblies were added before or after Flagger-filtering?

Thank you! Guillaume

mobinasri commented 1 year ago

No assembly block was filtered using Flagger. It was mainly used for comparing with chm13v1.1. Regarding your point about small HSat blocks. I think it is better to include shorter HSat blocks and make a single coverage file per HSat. I recently observed that including HSat blocks only larger than 1mb may not be enough since shorter blocks may also be susceptible to coverage biases.

GuillaumeHolley commented 1 year ago

Hi @mobinasri,

Thank you for the feedback, it is very much appreciated :) I think I've got everything I need so I'll close the issue.

Guillaume

mobinasri / flagger

A few questions about the WDL pipeline #5