Open subwaystation opened 4 years ago
I agree this is a bug and an issue. I don't really have ideas for where this would come from. There's a number of places like cut_points
or bins cutting Nodes in half. It could be a bad sort earlier. It's definitely high priority to resolve this.
This issue doesn't seem to crop up in the runB1phi1 test data. What is the " -g
option, "?
Looks like -g
removes both gap links (useless) as well as begin and end links to bin 0 (debatable but still useful info). The last departure column is the adjacent connectors. Importantly, this is based on the absence of evidence that a path uses a LinkColumn.
Currently output:
"departures": [
{
"upstream": 4,
"downstream": 12,
"participants": [
false,
false,
true
]
},
{
"upstream": 4,
"downstream": 5,
"participants": [
true,
true,
false
]
}
],
Based on the GFA:
S 1 AGGA
S 2 AGTG
S 3 TCC
S 4 TCTCAG
P 5 1+,3+,4+ 8M,1M,1M
P 5- 1+,3-,4+ 8M,1M,1M
P 6 1+,4+,2+ 8M,1M,1M
We can change the Node labels to be less confusing (matching the bin id).
S 1 AGGA
S 5 AGTG
S 9 TCC
S 12 TCTCAG
P "5" 1+,9+,12+
P "5-" 1+,9-,12+
P "6" 1+,12+,5+
Put this way we can see the links reflected in svg1.gfa.og.w1.g.json |
Path | without -g | with -g |
---|---|---|---|
"5" | [[0,1],[4,9],[17,0]] | [] | |
"5-" | [[0,1],[4,9],[17,0]] | [] | |
"6" | [[0,1],[4,12],[17,5],[8,0]] | [[4,12],[17,5]] |
[4,9] is skipped since they have no content in the next node. The "adjacent connector" is here because Nodes 5 and 9 were merged: "{"first_bin": 5,"last_bin": 11,"occupants": [true,true,true],
".
After looking at this for a long time, I'm not confident that this is an error in the output. It looks correct to me. There definitely are errors in the ordering of SARS-CoV-2 Link Columns, but this file does not appear to have those same errors.
I've posted v15 sparse CS outputs in this drive link for reference: [link]
Taking a look at this one
From Dmytro:
I know what happens in the tiny test.
There are three components: c1=[1,4], c2=[5,11], c3=[12,17]
. The routine add_ajacent_connector_colum
assumes they come exactly in this order c1-c2-c3
and adds the links to connect them: link1=[4,5], link2=[11,12]
. But actually the component order is c1-c3-c2
, and this is perfectly encoded by odgi and the main segmentation code: link1=[4,12], link2=[17,5]
.
I think add_ajacent_connector_colum is redundant for the component-component linking. The only useful place for this routine is to add the incoming/outgoing link to the very first/last component in the components natural order. (edited)
actually, this loop should do the job:
for comp in components:
if len(comp.departures) == 0: # no departures == add one link to the "outer space"
add_ajacent_connector_colum(comp, None)
if len(comp.arrivals) == 0: # no arrivals == add one link from the "outer space"
add_ajacent_connector_colum(None, comp)
with link participants == component path_ids
or some more careful work should be done to find the "loosen" paths and connect them to the "outer space".
Test data set: svg_opg1_issue_12052020.zip
Command lines used:
In the resulting JSON output, from e.g. line 182, I can observe the following:
But if you take a look at the
odgi bin
output, with including-g
option, only 2 links for participant number 3 aka6
are listed. No links at all should be present for the first two paths. So I am wondering where these links should come from? Am I missing something? @josiahseaman