Open ASLeonard opened 2 days ago
That is not good! Definitely unintended behavior if this is whats happening. @AndreaGuarracino did you develop the C++ integration of this? We were doing it in pggb before.
From the git blame, it looks like 64506f7 introduced the multiple smoothing rounds inside smoothxg, but the input graph to smoothxg::prep(args::get(gfa_in)...
was never changed to the iterated version. I'm testing a local PR now.
The difference looks quite small, but seems real.
Current behaviour
[smoothxg::(1-3)::smooth_and_lace] sorted 194091 path fragments
...
[smoothxg::(1-3)::main] smoothed graph length 5626665bp in 132279 nodes
...
[smoothxg::(2-3)::smooth_and_lace] sorted 151939 path fragments
...
[smoothxg::(2-3)::main] smoothed graph length 5625268bp in 132364 nodes
...
[smoothxg::(3-3)::smooth_and_lace] sorted 123654 path fragments
...
[smoothxg::(3-3)::main] smoothed graph length 5624377bp in 132402 nodes
After #212
[smoothxg::(1-3)::smooth_and_lace] sorted 196018 path fragments
...
[smoothxg::(1-3)::main] smoothed graph length 5625245bp in 132322 nodes
...
[smoothxg::(2-3)::smooth_and_lace] sorted 147034 path fragments
...
[smoothxg::(2-3)::main] smoothed graph length 5622784bp in 132596 nodes
...
[smoothxg::(3-3)::smooth_and_lace] sorted 118948 path fragments
...
[smoothxg::(3-3)::main] smoothed graph length 5621060bp in 132906 nodes
It does give the impression that the graph length is going down after your fix. But it's hard to be sure with just one test.
I was looking around some of the smoothxg internals to see if writing
.og
files between iterations would reduce the IO time (which I'll save for another issue), but came across a potential large bug.It looks like at this stage the
gfa_in
variable from the-g
flag is read for the prep stage during every iteration. So even though the prep stage is writing out to a different file (e.g. smooth.1.gfa.prep.2.fa), the input is always the same.https://github.com/pangenome/smoothxg/blob/66b17aeead7980f517fed8cc9002bdb2e86a0572/src/main.cpp#L432
I hope I've missed something, but this would imply that only the final smoothing iteration matters (as it still loads the original graph). It seems if you do use
--no-prep
(which then uses thepath_input_gfa
), this does use the output of the previous iteration, and so does benefit from multiple iterations.https://github.com/pangenome/smoothxg/blob/66b17aeead7980f517fed8cc9002bdb2e86a0572/src/main.cpp#L436
I'm not 100% sure my interpretation of the code is right and the nondeterminism in the prep step makes it hard to test properly. The output of running 3 iterations versus only the last does differ slightly, but has identical stats during the prep and smoothable_blocks step.