I am currently working with PGGB and I have encountered some challenges in determining the optimal parameters for my specific use case.
I'm trying to fine-tune the parameters to get the best possible results.
My main focus is on the s (--segment-length) and k (--min-match-len) parameters. I understand that these parameters can significantly impact the performance and accuracy of the analysis, but I'm unsure how to optimize them for my dataset (plant genomes including three different species).
As suggested in the user guide, I estimated the divergence using mash triangle to selected the best -p (--map-pct-id, in my case p=88), and I set -G (--poa-length) above the length of transposon repeats in the pangenome.
But I need help setting the -s and -k parameters.
I tried different combinations of -s (50000, 80000, 100000) and -k (30, 47) while maintaining the same value for -p and -G, and then I used odgi degree to obtain statistics on the obtained graphs, as reported below:
How can I leverage these statistics to select the best s and k parameter combination?
Could anyone provide insights on how to choose the s and k parameters in pggb? Are there specific dataset characteristics I should consider when selecting these values?
I would greatly appreciate any guidance or suggestions.
I am currently working with PGGB and I have encountered some challenges in determining the optimal parameters for my specific use case. I'm trying to fine-tune the parameters to get the best possible results. My main focus is on the s (--segment-length) and k (--min-match-len) parameters. I understand that these parameters can significantly impact the performance and accuracy of the analysis, but I'm unsure how to optimize them for my dataset (plant genomes including three different species). As suggested in the user guide, I estimated the divergence using mash triangle to selected the best -p (--map-pct-id, in my case p=88), and I set -G (--poa-length) above the length of transposon repeats in the pangenome. But I need help setting the -s and -k parameters. I tried different combinations of -s (50000, 80000, 100000) and -k (30, 47) while maintaining the same value for -p and -G, and then I used odgi degree to obtain statistics on the obtained graphs, as reported below:
Parameters #node.count edge.count avg.degree min.degree max.degree s50000_k30 14,518,207 19,984,480 2.75302 1 588 s80000_k30 15,079,091 20,758,683 2.75331 1 526 s100000_k30 14,986,719 20,620,628 2.75185 1 494
s50000_k47 14,508,667 19,949,829 2.75006 1 455 s80000_k47 15,083,722 20,740,873 2.7501 1 471 s100000_k47 14,993,146 20,619,435 2.75051 1 439
How can I leverage these statistics to select the best s and k parameter combination? Could anyone provide insights on how to choose the s and k parameters in pggb? Are there specific dataset characteristics I should consider when selecting these values? I would greatly appreciate any guidance or suggestions.
Thank you in advance.