I was thinking about scaffolding last night. I mentioned in #101 some ideas and future optimizations but thought I'd do a brain dump here as well.
Here's an idea for a "SplitRepeat" job: take the current framework but for each walk node that has 2+ back- or incoming-edges, we can request the candidate's score according to the read sequences in those back edges. When two candidates seem to have nearly equal scores, we can turn turn to the backedge scores to try to split the repeated frontier node (or some portion of the walk).
In ascii art:
b1
\ ----c1
w1--w2--w3--f<
----c2
in the simple version of this algorithm, we request scores of the subkmers f--c1 and f--c2 from b1 which aggregates the score to f, just like the walk nodes w1-w3 do. If we get a "strong" signal from b1 about one of the paths (say, b1--f--c1) and it's not the same as the path that's strong in the walk (say, w1--w2--w3--f--c2), then we could assume that f is a repeat node, shared by the two different paths. We could the split f into f1 and f2, resulting in the paths b1--f1--c1 and w1--w2--w3--f--c2.
@anbangx @Elmira88 @JavierJia @Nan-Zhang
I was thinking about scaffolding last night. I mentioned in #101 some ideas and future optimizations but thought I'd do a brain dump here as well.
Here's an idea for a "SplitRepeat" job: take the current framework but for each walk node that has 2+ back- or incoming-edges, we can request the candidate's score according to the read sequences in those back edges. When two candidates seem to have nearly equal scores, we can turn turn to the backedge scores to try to split the repeated frontier node (or some portion of the walk).
In ascii art:
in the simple version of this algorithm, we request scores of the subkmers
f--c1
andf--c2
from b1 which aggregates the score tof
, just like the walk nodesw1
-w3
do. If we get a "strong" signal from b1 about one of the paths (say,b1--f--c1
) and it's not the same as the path that's strong in the walk (say,w1--w2--w3--f--c2
), then we could assume thatf
is a repeat node, shared by the two different paths. We could the splitf
intof1
andf2
, resulting in the pathsb1--f1--c1
andw1--w2--w3--f--c2
.