marschall-lab / panacus

Panacus is a tool for computing statistics for GFA-formatted pangenome graphs
MIT License
73 stars 4 forks source link

Discrepancy between graph length, reference length, and novel base pairs #25

Closed CCLittlefield closed 3 months ago

CCLittlefield commented 3 months ago

Hi,

I have created a pangenome using the minigraph-cactus pipeline in a manner very similar to how the HPRC created their pangenome (using T2T-CHM13 as the primary reference and including GRCh38 as an assembly), I then ran panacus and identified the amount of base pairs added per assembly, and in total. The total amount of basepairs added (coverage 1, quorum 0, last row) was 0.489 Gb. The length of the T2T-CHM13 reference is 3.11 Gb and the total length of the graph (determined using odgi stats) is 3.38 Gb. Am I wrong in thinking that the length of the reference, 3.11 Gb, plus the length of total bp added, 0.489 Gb, should equal the total length of the graph (3.38 Gb)? Or do I misunderstand? I've tried a variety of parameters and still get the same results. The code:

grep -e '^W' $gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' > corrected.paths.txt

grep -e 'GRCh38' corrected.paths.txt > corrected.paths.grch38.txt

grep -ive 'grch38|chm13' corrected.paths.txt > corrected.paths.haplotypes.txt

panacus ordered-histgrowth --count bp --order $order --exclude corrected.paths.grch38.txt --subset corrected.paths.haplotypes.txt --output-format table --coverage 1,2,4,67 --groupby-sample --threads $SLURM_CPUS_ON_NODE $gfa > $gfa.ordered-histgrowth.bp.tsv

danydoerr commented 3 months ago

From your code, it looks like you exclude GRCh38, and compute growth for the subset, omitting GRCh38 and CHM13. Then you reason about the size of the total graph, but taking CHM13 as baseline. Even if your subset omits CHM13, it still counts nodes/bps covered by CHM13 if they are covered also by haplotypes of the subset. So this is the difference between exclude and subset: The exclude option makes sure that no bp of the specified paths is ever counted, while the subset option does count bps covered by the subset.

danydoerr commented 3 months ago

In other words, if you exclude CHM13 rather than GRCh38, this should check out.

CCLittlefield commented 3 months ago

Thank you for your quick response, It seems I misunderstood the function of the subset and exclude options. I reran the analysis and all the numbers check out. Thanks again for your help.