pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
346 stars 37 forks source link

Parameters optimization #360

Open omar-almolla209 opened 7 months ago

omar-almolla209 commented 7 months ago

I am currently working with PGGB and I have encountered some challenges in determining the optimal parameters for my specific use case. I'm trying to fine-tune the parameters to get the best possible results. My main focus is on the s (--segment-length) and k (--min-match-len) parameters. I understand that these parameters can significantly impact the performance and accuracy of the analysis, but I'm unsure how to optimize them for my dataset (plant genomes including three different species). As suggested in the user guide, I estimated the divergence using mash triangle to selected the best -p (--map-pct-id, in my case p=88), and I set -G (--poa-length) above the length of transposon repeats in the pangenome. But I need help setting the -s and -k parameters. I tried different combinations of -s (50000, 80000, 100000) and -k (30, 47) while maintaining the same value for -p and -G, and then I used odgi degree to obtain statistics on the obtained graphs, as reported below:

Parameters #node.count edge.count avg.degree min.degree max.degree s50000_k30 14,518,207 19,984,480 2.75302 1 588 s80000_k30 15,079,091 20,758,683 2.75331 1 526 s100000_k30 14,986,719 20,620,628 2.75185 1 494

s50000_k47 14,508,667 19,949,829 2.75006 1 455 s80000_k47 15,083,722 20,740,873 2.7501 1 471 s100000_k47 14,993,146 20,619,435 2.75051 1 439

How can I leverage these statistics to select the best s and k parameter combination? Could anyone provide insights on how to choose the s and k parameters in pggb? Are there specific dataset characteristics I should consider when selecting these values? I would greatly appreciate any guidance or suggestions.

Thank you in advance.