Setting -n to number of genomes, or number of genomes minus one?

pangenome / pggb

the pangenome graph builder

https://doi.org/10.1038/s41592-024-02430-3

MIT License

368 stars 41 forks source link

Setting -n to number of genomes, or number of genomes minus one? #227

Closed henrivkgt closed 2 years ago

henrivkgt commented 2 years ago

Hello,

I am trying to run the pggb tool on a set of six cucumber genomes. One thing that is not completely clear to me is how to set the -n parameter. In one doc page (https://pggb.readthedocs.io/en/latest/rst/tutorials/sequence_partitioning.html) a set of 7 yeast genomes is used, with -n representing the number of mappings per locus (which is 6, or the number of genomes minus one). In others, such as the quick start (https://pggb.readthedocs.io/en/latest/rst/quick_start.html) it seems the -n is set to the number of genomes (so not minus one).

Would you be able to clear this up to me?

Thanks in advance, Henri

AndreaGuarracino commented 2 years ago

Hi @henrivkgt,

those -ns are not the same parameter. In the first example (https://pggb.readthedocs.io/en/latest/rst/tutorials/sequence_partitioning.html), -n refers to a parameter of wfmash (the sequence aligner we use in pggb). Instead, in the second example (https://pggb.readthedocs.io/en/latest/rst/quick_start.html), -n refers to a parameter of pggb.

Since they have the same name, I wonder if we should make the handling of these -ns the same from the outside (hiding the -1 thing) to avoid other confusion in the future.

henrivkgt commented 2 years ago

Thank you, that makes sense.

ekg commented 2 years ago

This is a bad documentation bug. The tutorial isn't in sync with the code. pggb's help text also doesn't explain that this should be set equal to the number of expected homologous haplotypes within the pangenome.

The way to use -n is that it is equal to the number of haplotypes that you expect in your sample. For instance, if you had N=10 diploid genomes as input, you'd expect (typically) to see 2N=20 homologous copies of each locus. In this case, we should run pggb -n 20. If we just have 10 sequences, or 10 haploid genomes, we'd run pggb -n 10.