Add defined behavior for duplicate edges

marbl / MetagenomeScope

Visualization tool for (meta)genome assembly graphs

https://marbl.github.io/MetagenomeScope/

GNU General Public License v3.0

24 stars 8 forks source link

Add defined behavior for duplicate edges #75

Open fedarko opened 7 years ago

fedarko commented 7 years ago

From @fedarko on August 4, 2017 1:41

I think they're used in layout but not rendered at present? ==> Yep, just looked over my code and that seems to be the case. Options:

raise error in collate.py
give a warning and don't incorporate them in layout -- leaning toward this, or the above
incorporate them in layout and don't render them in JS (this isn't a good idea, but it's what's happening now)
incorporate them in layout and render them in JS
incorporate them in layout and have them toggleable in JS?

Copied from original issue: fedarko/MetagenomeScope#258

fedarko commented 7 years ago

also need to consider how these edges affect statistics; right now they're all included in the total edge count, but that behavior isn't necessarily set in stone

fedarko commented 7 years ago

Another downside of implicit edge IDs (described in the message for commit d7eaf4a above) is that it slightly messes up the ID displayed for edges adjacent to collapsed clusters (since the source/target change). All that's different is the ID displayed in the selected info table is different -- not a big deal, but still the sort of thing that shows that this workaround is an inconvenience and should eventually be reverted.

fedarko commented 6 years ago

It's looking like certain GFA files, in particular, can have these defined (I think fastg2gfa can produce them). So supporting this behavior (likely through just the "warning" thing, or maybe just ignoring them unless the user requests the warnings to be output) would be a good option.

We could also output, like, a warnings.txt file or something that contains all duplicated edges. That'd avoid the costs of printing all those duplicates to stdout.

Anyway, this is kind of a lower priority issue since tools like gfaview can apparently be used to remove redundant edges in files like this. I'd still like to get this addressed, though.

fedarko commented 6 years ago

A simple solution for this is to just check (upon parsing any given edge declaration) if the edge in question currently exists in the graph. (Either by checking nodeid2outgoingnodeids for GFA and FASTG files, or by checking nodeid2obj[src_id].outgoing_nodes for LastGraph and GML files.) If so, add the edge; if not, ignore it.

Pseudocode for the GFA/FASTG approach:

if src_id in nodeid2outgoingnodeids:
    if snk_id in nodeid2outgoingnodeids[src_id]:
        continue

Pseudocode for the LastGraph/GML approach:

if nodeid2obj[snk_id] in nodeid2obj[src_id].outgoing_nodes:
    continue

For filetypes which have implied edges (LastGraph/sometimes GFA), this also takes care of redundant edges brought about by implication like following --

A -> -B
B -> -A

-- in which the first edge implies the second edge. In this case, we'd just parse the first edge declaration, and then when we'd get to the second one it would already be in the graph, so we could just ignore it.

fedarko commented 5 years ago

Yeah, easiest solution is just throwing an error if these are found. no need to support these as far as i can tell

fedarko commented 4 years ago

If we want to support showing de Bruijn graphs in the more "conventional" way (e.g. SPAdes graphs with edges actually shown as edges, unlike Bandage where the edges are shown as nodes) then we'll need to cross this bridge eventually. Likely not urgent for now, though, since this shift would likely involve a lot of other changes besides this (e.g. completely different pattern detection algorithms, I think...?).

FWIW AGB does support visualizing these sorts of graphs