marbl / MetagenomeScope

Visualization tool for (meta)genome assembly graphs
https://marbl.github.io/MetagenomeScope/
GNU General Public License v3.0
24 stars 8 forks source link

[Feature Request] Output a graph file that could be used elsewhere (e.g., NetworkX, iGraph, or Cytoscape) #234

Closed jolespin closed 2 years ago

jolespin commented 2 years ago

I saw the notice that you were refactoring the code. If you are still working on it, would you be able to add an option that outputs an edge list file?

[node_a]<tab>[node_b]<tab>[edge_weight]
fedarko commented 2 years ago

Thanks for the suggestion! Yes, I'm still working on this; it's one of a few things on my plate at the moment.

I'm not sure how easy it'd be to add a command-line utility for this -- it may be a bit out of scope for the project right now -- but it's possible to do this in a few lines of Python code if you have MetagenomeScope installed. We can use the assembly graph parser to take care of this, as shown below.

(I just updated the codebase earlier today to make the installation and use of submodules like assembly_graph_parser easier, so I recommend updating if the code below doesn't work for you.)

import metagenomescope as mgsc
g = mgsc.assembly_graph_parser.parse("my-graph-file.lastgraph")
# g is now a NetworkX object describing the assembly graph.

# Create a TSV file describing this graph's edges.
# Note that different graph filetypes will have different names for edge weights
# (and that some graph filetypes may not even include edge weights).
# For example, Velvet graphs will use "multiplicity" as the name for their edge weights.
out_text = "SourceNode\tTargetNode\tEdgeWeight\n"
for e in g.edges:
    out_text += f"{e[0]}\t{e[1]}\t{g.edges[e]['multiplicity']}\n"

with open("edge-list-file.tsv", "w") as ef:
    ef.write(out_text)
jolespin commented 2 years ago

Awesome, which file from metaSPAdes is the my-graph-file.lastgraph. Is that the fastg, paths, or gfa file?

fedarko commented 2 years ago

It will be either the FASTG or the GFA file. In the context of SPAdes output, I think these two graph representations should be equivalent per the documentation (http://cab.spbu.ru/files/release3.12.0/manual.html#sec3.5), although I can't guarantee this at the moment.

On Fri, Jun 24, 2022, 12:39 AM Josh L. Espinoza @.***> wrote:

Awesome, which file from metaSPAdes is the my-graph-file.lastgraph. Is that the fastg, paths, or gfa file?

— Reply to this email directly, view it on GitHub https://github.com/marbl/MetagenomeScope/issues/234#issuecomment-1165295003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA736P3WRE4MQUJ54MN4YRTVQVQ23ANCNFSM5ZUWQI3A . You are receiving this because you modified the open/close state.Message ID: @.***>

fedarko commented 2 years ago

After thinking about this some more: I should clarify that the definition of "edge" can be tricky, depending on what assembler produced the input assembly graph. In the case of SPAdes, the assembled sequences are stored as edges in the original de Bruijn graph, but the GFA and FASTG file formats both implicitly convert these edges to nodes. (I'm not 100% sure about if this applies to SPAdes' GFA output, because I don't have much recent experience with SPAdes, but I know this is what most de-Bruijn-graph-based assemblers do when outputting to GFA.)

When either the GFA or the FASTG file from SPAdes are loaded in MetagenomeScope (or in most other visualization tools for assembly graphs), the graph produced (g, in the code example above) will be "flipped": so nodes in the loaded graph really correspond to edge sequences in the original de Bruijn graph, and edges in the loaded graph really correspond to nodes (connections between edges) in the original de Bruijn graph.

This distinction is important for this question, because I'm not sure if you wanted

  1. a list of edges in the original de Bruijn representation (in which each line in the TSV file represents an assembled sequence), or
  2. a list of edges in the second representation (in which each line in the TSV file represents a connection between assembled sequences).

The method I described above will produce the second type of TSV file.