Closed GuillaumeHolley closed 1 year ago
I have another question if you don't mind :) Could you point me to the CHM13's HSAT BED files you are using in input of the pipeline? The only BED file I could find is the satellites coordinates in the centromeres of CHM13 (which contains some HSAT) but I have no idea if this is what you are using too.
Hi @GuillaumeHolley Thanks for looking into it in detail.
fit_model.wdl
you are required to pass a value in the outer workflow flagger.wdl
. As you mentioned correctly you can use the output of calc_mean_sd.py
for finding a reasonable value.split_cov_by_window
merges the last two windows of a contig if the last one is shorter than the given size (5mb in this case). For example if we have a contig of length 17mb, we would have three windows with sizes 5, 5, and 7. Using awk '~{minWindowSize} <= $3'
is for filtering short windows, which can happen only for contigs shorter than ~{minWindowSize}
. In other words it excludes contigs shorter than ~{minWindowSize}
. These are the HSat coordinates I'm using as inputs. (including only the arrays longer than 1mb)
https://storage.googleapis.com/masri/hprc/chm13_hsat_beds/t2t_cenAnnotation.v1.1.hsat1_gt_1mb.bed
https://storage.googleapis.com/masri/hprc/chm13_hsat_beds/t2t_cenAnnotation.v1.1.hsat2_gt_1mb.bed
https://storage.googleapis.com/masri/hprc/chm13_hsat_beds/t2t_cenAnnotation.v1.1.hsat3_gt_1mb.bed
They are all extracted from the censat annotation provided by the T2T consortium.
http://public.gi.ucsc.edu/~khmiga/t2t_cenAnnotation.v1.1.bed
Hi @mobinasri ,
Thank you for you very thorough answer. So I think I got almost everything right with running the Flagger pipeline on my side. I picked the same annotations as you for the HSAT but I used CHM13 v2 rather than v1.1 which should be good too. However, when running Flagger on my assemblies, I used all HSAT annotations rather than just the ones which are 1MB long. I imagine this 1MB length is due to the inability to obtain a smooth coverage distribution on shorter regions? Do you think this is going to be a problem for the final output of Flagger (using all HSAT rather than the 1Mbp+ long ones)?
After running Flagger on my assemblies, I get around 50 Mbp of regions which are unreliable (collapsed, duplicated, erroneous and unknown) per assembly and on average which seems to be roughly in line with the number in the HPRC graph paper. However, if I filter my assemblies by keeping only the sub-contigs which are reliable (haploid), it breaks the contiguity of the assemblies quite a bit. Now, Minigraph only uses by default contigs which are at least 100 kb long when adding assemblies to a graph. For any of my assemblies, it seems that only 65% of the contigs are at least 100 kb long which means Minigraph would discard 35% of each assembly on average. In the HPRC Minigraph or Minigraph-Cactus graphs, do you know if the assemblies were added before or after Flagger-filtering?
Thank you! Guillaume
No assembly block was filtered using Flagger. It was mainly used for comparing with chm13v1.1. Regarding your point about small HSat blocks. I think it is better to include shorter HSat blocks and make a single coverage file per HSat. I recently observed that including HSat blocks only larger than 1mb may not be enough since shorter blocks may also be susceptible to coverage biases.
Hi @mobinasri,
Thank you for the feedback, it is very much appreciated :) I think I've got everything I need so I'll close the issue.
Guillaume
Hi,
I skimmed through the WDL files recently and I have a handful of questions about the pipeline:
fit_model
steps seem to have a hard-coded coverage of 20 used as parameter offit_gmm.py
. Shouldn't it use the output ofcalc_mean_sd.py
instead?.cov
files to.bed
use the followingawk
instruction:if(substr($1,1,1) == ">") {contig=substr($1,2,40);
. This is assuming that the assembly has contig names no longer than 40 characters?split_cov_by_window
creating coverage files for each window of 5mbp outputs a "summary" filewindows.txt
. This file seems to have a BED file structure from what I saw: every line is contig name, start position and end position. However, in taskfindBlocksByWindow
, this file is parsed and filtered using the expressioncat ~{windowsText} | awk '~{minWindowSize} <= $3'
. Using this, it seems that the 3rd field of each line inwindows.txt
is the length of the window rather than the end position in the contig?Thank you for your help. Guillaume