Smoothing iterations do not use result of previous iteration when preparing graph

ASLeonard commented 2 days ago

I was looking around some of the smoothxg internals to see if writing .og files between iterations would reduce the IO time (which I'll save for another issue), but came across a potential large bug.

It looks like at this stage the gfa_in variable from the -g flag is read for the prep stage during every iteration. So even though the prep stage is writing out to a different file (e.g. smooth.1.gfa.prep.2.fa), the input is always the same.

https://github.com/pangenome/smoothxg/blob/66b17aeead7980f517fed8cc9002bdb2e86a0572/src/main.cpp#L432

I hope I've missed something, but this would imply that only the final smoothing iteration matters (as it still loads the original graph). It seems if you do use --no-prep (which then uses the path_input_gfa), this does use the output of the previous iteration, and so does benefit from multiple iterations.

https://github.com/pangenome/smoothxg/blob/66b17aeead7980f517fed8cc9002bdb2e86a0572/src/main.cpp#L436

I'm not 100% sure my interpretation of the code is right and the nondeterminism in the prep step makes it hard to test properly. The output of running 3 iterations versus only the last does differ slightly, but has identical stats during the prep and smoothable_blocks step.

Alex

ekg commented 2 days ago

That is not good! Definitely unintended behavior if this is whats happening. @AndreaGuarracino did you develop the C++ integration of this? We were doing it in pggb before.

ASLeonard commented 1 day ago

From the git blame, it looks like 64506f7 introduced the multiple smoothing rounds inside smoothxg, but the input graph to smoothxg::prep(args::get(gfa_in)... was never changed to the iterated version. I'm testing a local PR now.

ASLeonard commented 1 day ago

The difference looks quite small, but seems real.

Current behaviour

[smoothxg::(1-3)::smooth_and_lace] sorted 194091 path fragments
...
[smoothxg::(1-3)::main] smoothed graph length 5626665bp in 132279 nodes
...
[smoothxg::(2-3)::smooth_and_lace] sorted 151939 path fragments
...
[smoothxg::(2-3)::main] smoothed graph length 5625268bp in 132364 nodes
...
[smoothxg::(3-3)::smooth_and_lace] sorted 123654 path fragments
...
[smoothxg::(3-3)::main] smoothed graph length 5624377bp in 132402 nodes

After #212

[smoothxg::(1-3)::smooth_and_lace] sorted 196018 path fragments
...
[smoothxg::(1-3)::main] smoothed graph length 5625245bp in 132322 nodes
...
[smoothxg::(2-3)::smooth_and_lace] sorted 147034 path fragments
...
[smoothxg::(2-3)::main] smoothed graph length 5622784bp in 132596 nodes
...
[smoothxg::(3-3)::smooth_and_lace] sorted 118948 path fragments
...
[smoothxg::(3-3)::main] smoothed graph length 5621060bp in 132906 nodes

ekg commented 1 day ago

It does give the impression that the graph length is going down after your fix. But it's hard to be sure with just one test.

pangenome / smoothxg

Smoothing iterations do not use result of previous iteration when preparing graph #210