graph-genome / component_segmentation

Read in ODGI Bin output and identify co-linear components
Apache License 2.0
3 stars 4 forks source link

CS might add links where none are given from `odgi bin` #48

Open subwaystation opened 4 years ago

subwaystation commented 4 years ago

Test data set: svg_opg1_issue_12052020.zip

Command lines used:

odgi build -g svg1.gfa -o svg1.gfa.og
/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 1 -s -g > svg1.gfa.og.w1.g.json
/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 1 -s > svg1.gfa.og.w1.json
python ~/git/component_segmentation/segmentation.py -j svg1.gfa.og.w1.json -f svg1.gfa.og.fasta -o ./cs_svg1_issue_12052020

In the resulting JSON output, from e.g. line 182, I can observe the following:

                    "upstream": 4,
                    "downstream": 5,
                    "participants": [
                        true,
                        true,
                        false
                    ]

But if you take a look at the odgi bin output, with including -g option, only 2 links for participant number 3 aka 6 are listed. No links at all should be present for the first two paths. So I am wondering where these links should come from? Am I missing something? @josiahseaman

josiahseaman commented 4 years ago

I agree this is a bug and an issue. I don't really have ideas for where this would come from. There's a number of places like cut_points or bins cutting Nodes in half. It could be a bad sort earlier. It's definitely high priority to resolve this.

josiahseaman commented 4 years ago

This issue doesn't seem to crop up in the runB1phi1 test data. What is the " -g option, "?

josiahseaman commented 4 years ago

Looks like -g removes both gap links (useless) as well as begin and end links to bin 0 (debatable but still useful info). The last departure column is the adjacent connectors. Importantly, this is based on the absence of evidence that a path uses a LinkColumn.

Currently output:

"departures": [
                {
                    "upstream": 4,
                    "downstream": 12,
                    "participants": [
                        false,
                        false,
                        true
                    ]
                },
                {
                    "upstream": 4,
                    "downstream": 5,
                    "participants": [
                        true,
                        true,
                        false
                    ]
                }
            ],

Based on the GFA:

S   1   AGGA
S   2   AGTG
S   3   TCC
S   4   TCTCAG
P   5   1+,3+,4+    8M,1M,1M
P   5-  1+,3-,4+    8M,1M,1M
P   6   1+,4+,2+    8M,1M,1M

We can change the Node labels to be less confusing (matching the bin id).

S   1   AGGA
S   5   AGTG
S   9   TCC
S   12  TCTCAG
P   "5" 1+,9+,12+
P   "5-"    1+,9-,12+
P   "6" 1+,12+,5+
Put this way we can see the links reflected in svg1.gfa.og.w1.g.json Path without -g with -g
"5" [[0,1],[4,9],[17,0]] []
"5-" [[0,1],[4,9],[17,0]] []
"6" [[0,1],[4,12],[17,5],[8,0]] [[4,12],[17,5]]

[4,9] is skipped since they have no content in the next node. The "adjacent connector" is here because Nodes 5 and 9 were merged: "{"first_bin": 5,"last_bin": 11,"occupants": [true,true,true],".

After looking at this for a long time, I'm not confident that this is an error in the output. It looks correct to me. There definitely are errors in the ordering of SARS-CoV-2 Link Columns, but this file does not appear to have those same errors.

I've posted v15 sparse CS outputs in this drive link for reference: [link]

mandosoft commented 4 years ago

Taking a look at this one

josiahseaman commented 4 years ago

From Dmytro: I know what happens in the tiny test.
There are three components: c1=[1,4], c2=[5,11], c3=[12,17]. The routine add_ajacent_connector_colum assumes they come exactly in this order c1-c2-c3 and adds the links to connect them: link1=[4,5], link2=[11,12]. But actually the component order is c1-c3-c2, and this is perfectly encoded by odgi and the main segmentation code: link1=[4,12], link2=[17,5].

I think add_ajacent_connector_colum is redundant for the component-component linking. The only useful place for this routine is to add the incoming/outgoing link to the very first/last component in the components natural order. (edited)

actually, this loop should do the job:

for comp in components:
    if len(comp.departures) == 0:  # no departures == add one link to the "outer space"
        add_ajacent_connector_colum(comp, None)
    if len(comp.arrivals) == 0:  # no arrivals == add one link from the "outer space"
        add_ajacent_connector_colum(None, comp)

with link participants == component path_ids

or some more careful work should be done to find the "loosen" paths and connect them to the "outer space".