neherlab / pangraph

A bioinformatic toolkit to align genome assemblies into pangenome graphs
https://neherlab.github.io/pangraph
MIT License
77 stars 7 forks source link

Fix/issue 56 #57

Closed mmolari closed 1 year ago

mmolari commented 1 year ago

Fixes issue 56, due to an edge-case in the block-merging procedure. It also:

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
pangraph ❌ Failed (Inspect) Jun 19, 2023 2:52pm
ivan-aksamentov commented 1 year ago

@mmolari Just a few thoughts (perhaps you've already considered some of these):

The -v flag is most commonly used in command-line interfaces to show the version of the executable (alias for --version), or sometimes to request verbose mode (so that the tool prints more output; alias for --verbose). In order to avoid breaking of the public interface of the software in the future, we should probably consider another letter for the argument requesting checks. Maybe --consistency-checks and -C would work better?

It is also customary to expose random seeds as CLI arguments (or as function parameters, in libraries). This way you don't need to invent any internal magic, such as deducing seeds from input data or from current weather, and users requiring deterministic, reproducible results could simply provide the same value through CLI (e.g. --random-seed=12345) consistently for each run.

From my experience, in order to achieve reproducible results, it is essential for multi-threaded applications (where thread scheduling is determined by the operating system and is outside of control of the application) to either ensure consistent ordering when joining results (by e.g. sorting them according to the order in the original inputs) or also by allowing users to opt-out from multi-threading (with something like --jobs=1).

If you opt for sorting or similar technique, it sometimes also makes sense to allow the user to toggle the sorting on and off, with something like --ensure-order (speed vs accuracy trade-off).

In general, offloading decisions to the user often makes code simpler and reduces the feeling of magic when using a piece of software, let alone that it reduces amount of work for you :). Also, advanced users often appreciate more control.

Although this is not without downsides - CLI args is a public interface to maintain forever, and when there's many flags they may make the interface more difficult to use for beginners (or to use it correctly). So sane defaults are very important.

mmolari commented 1 year ago

Thanks @ivan-aksamentov! All of these suggestions are much appreciated.

Concerning the flag, I originally chose -v because I wanted it to be sort of a verbose mode, in which other than performing consistency checks also some more explicit logging of the process would be performed. But you are right, it's better to separate the two. I changed it to -t and --test, since -c was already taken.

For the random seeding, I was thinking of setting always the same random seed, and not give the user the option to control it, since the only thing that this seed will control is the random name of blocks. It should not impact the graph structure in any other way. And I do the seeding in a way that is robust to parallelization. Irrespective of number of threads and scheduling of operations, the results are always the same and saved in the same order. For these reasons I was thinking of setting a standard random seed in the code and not exposing any interface to change it. But do you think it's better to still give the user the option?

mmolari commented 1 year ago

This pull request: