qwx9 / strangepg

strange pangenome and large graph visualization
MIT License
10 stars 0 forks source link
coarsening external-memory gfa-format graph-visualization indexing large-graph pangenome pangenome-graph scalability visualization

Strange pangenome scale visualization

Interactive visualization of large genome graphs in GFAv1 format à la Bandage.

Sales pitch:

Visualizing large graphs in the million node scale and beyond remains a challenge and is relevant to multiple fields of study. In
pangenomics, as databases are continuously enriched by new and high quality assemblies, there is a growing need for tools and
methods tailored to handle extremely large volumes of data. As pangenome sizes multiply, available techniques quickly hit
operational limits in terms of processing time and memory. In particular, visualizing graphs in a general, intuitive and interactive
manner is useful for analysis and interpretation, yet computationally prohibitive at such a scale. The main objective of this
project is the development of strangepg, a new tool and visualization workflow aiming to address these limitations by employing
the combination of offline indexing and graph coarsening with the extensive use of a generic external memory layer. By
offloading the major computational effort to a preprocessing step, the interactive visualizer loads only a small fraction of the
total data and fetches more from disk only on request. The use of external memory ensures that every step can be performed on
commodity hardware for arbitrary sized graphs. strangepg is being implemented in C in a highly modular, portable and
extensible manner, designed to allow substituting different algorithms for most steps with minimal effort, and aims to provide a
general framework for experimentation with layouting and new visualization techniques.

Named in reference to the Dr. Strangelove character in Stanley Kubrick's Dr. Strangelove or: how I learned to stop worrying and love the bomb (1964), strangepg provides an interactive visualization of bidirected and undirected graphs in a familiar force-directed layout but aims to scale to hundreds of millions of nodes and beyond. Such is the size of pangenome graphs today, the direct application for this tool. A big emphasis is placed on performance, reactivity and immediate feedback. While it currently supports only GFA files as input, it will in future be extended to support other formats (GraphML, DOT, Newick, etc) and types of trees (Newick for phylogenetic trees, etc.).

Note: this is a work in progress and under heavy development; significant or breaking changes happen all the time, but always prefer building from source on the latest commit to other methods of installation in order to get the latest fixes and additions. Git tags do not mark stable releases. Please consider this to be a public beta of sorts and feel free to send bug reports, feature requests or comments. Thanks!

Table of contents

Features

Without coarsening, the current layouting algorithm while a parallelized and slightly improved version of the classic Fruchterman-Reingold [1] force-directed algorithm, is still slow for 10k+ node graphs. It will however be adequate for a coarsened graph since it only ever works on whatever is currently loaded; other algorithms (SGD2, FM3, etc.) could later be implemented as well. Right now, because layouting is parallelized, in real-time and can be interacted with, not that much is left to make it more useful in practice.

Near finished:

Near future:

Future:

Released under the terms of the MIT license.

TL;DR

Mandatory arguments: a graph in GFA format, always as the last command line parameter.

Load a gfa file named some.gfa:

strangepg some.gfa

Increase number of threads for layouting:

strangepg -t 8 some.gfa

Select a different layout algorithm:

strangepg -l fr some.gfa

Load a CSV file named some.csv:

strangepg -c some.csv some.gfa

Load layout from a file named some.lay (note: the layout file format is strangepg-specific):

strangepg -f some.lay some.gfa

Installation

Currently only Linux (and 9front) are supported. A macOS port will arrive as soon as someone kindly sacrifices their laptop for a while.

Installation can be done from source or via bioconda.

Hardware requirements

On Linux, a graphics card with OpenGL 4.1 support is required. The standard was introduced in 2010-2011. Intel integrated HD Graphics cards from 2013 (Ivy Bridge), and nVIDIA and AMD/ATI cards from 2010 on should work.

Bioconda

Install conda, add the bioconda channel, then:

conda install strangepg

Compilation

git clone https://github.com/qwx9/strangepg
cd strangepg
make -j install

-j is an optional flag to enable parallel building using all available cores. This installs the binaries strangepg and strawk, by default in $HOME/.local/bin. If this directory is not in your $PATH or a different installation directory is desired, see Additional compilation settings below.

NOTE: manual compilation with clang is recommended as it currently produces noticeably faster binaries. Use the CC make variable to force compilation with it. To my knowledge, bioconda binaries are built with GCC and thus may have slightly lower performance. I do not yet know why.

Dependencies

strangepg requires an OpenGL implementation and X11 libraries. Those are usually already present on typical Linux systems. Wayland is not supported natively.

Command line for ubuntu (adapt to your system):

apt install libbsd0 libgl-dev libglvnd-dev libglx-dev libmd0 libx11-dev libxau-dev libxcb1-dev libxcursor-dev libxdmcp-dev libxext-dev libxfixes-dev libxi-dev libxrandr-dev 

Tested with gcc 11.4.0 and 13.2.0, and clang 14.0.0 and 17.0.6 on Void Linux and Ubuntu 22.04/24.04.

Usage

strangepg requires at least one input file as argument. It currently supports graphs in GFA format.

strangepg file.gfa

Some test examples exist in the test/ directory. For example:

strangepg test/03.232.gfa

Command-line options

$ strangepg -h
usage: strangepg [-Zbhvw] [-f FILE] [-l ALG] [-t N] [-c FILE] FILE
-b             White-on-black color theme
-c FILE        Load tags from csv FILE
-f FILE        Load layout from FILE
-l ALG         Set layouting algorithm (default: pfr)
-t N           Set number of layouting threads (1-128, default: 4)
-w             Force layouting to wait until all inputs are loaded
-Z             Minimize node depth (z-axis) offsets in 2d layouts

The most important options are:

Optional input files:

Additional settings:

Drawing options:

Layouting

Layouting is performed and visualized in real time and in parallel. Currently all available layout algorithms are based on a spring model force-directed approach, and are variations of the classic Fruchterman-Reingold algorithm [1]. Parameters and heuristics are hand-tuned and may require further adjustment for better results, or may warrant better approaches. Because the initial state is random, results are different every time, and due to trade-offs for speed some may be better than others. The real-time aspect of the application allows the user to easily restart layouting if the result so far is unsatisfactory, or to guide it by loading more information or moving nodes manually.

Here the approach to custom or application-specific layouting is to combine a basic layout algorithm with specifying a partial or full initial state and/or adding constraints. For example, a linear layout may be achieved by fixing the x coordinate of some or all nodes to one or more constant values. 3D layouting works in the same way -- all layouts are actually in 3D space, but 2D algorithms ignore the 3rd axis. Reproducible layouting is possible by saving layouts to file and later loading them again.

The basic layouting algorithm serves as a backbone for more specific visualizations, changing the type of geometry: circular, spherical, non-euclidean, etc. These basic additional layouts are currently under development.

Basic layouting

Available algorithms:

Use the -t command line parameter to change the number of threads used for layouting, 4 by default. Single threaded algorithms will always only use one of them.

Interaction

The following keyboard shortcuts are available:

Nodes may be dragged around with the mouse. Moving nodes does not fix their position to a constant position, and if layouting is currently underway it will influence the layout as a whole. Nodes can be fixed via tags.

Restarting it is cheap; try it if the layout doesn't look good. Intermediate or final results can be saved to file, then used as an initial state for another round of layouting.

Loading from and saving to file

A pre-existing layout file may be loaded with the -f command line parameter, irrespective of the selected layout algorithm. The file format is binary and specific to strangepg. Currently it's assumed that the layout file contains the exact same number of segments (S records) and in the same order of appearance (in either S or L records) as the originating GFA.

The current layout may be exported or imported at runtime with the exportlayout("file") and importlayout("file") functions (see Graph manipulation).

Navigation

Moving the graph around is done primarily with the mouse.

Keyboard shortcuts:

A status window currently labeled Prompt presents a text box to write commands in, and shows selected and hovered over objects. Currently, it only shows the name and length (nodes) or endpoints, orientation and CIGAR string (edges), but will be extended to show all of a node's tags.

The window can be moved around with the mouse, collapsed by clicking the top right button, or resized by dragging the lower right corner.

Graph manipulation

strangepg embeds a simple graph manipulation language, which presents tags as tables (associative arrays) and provides means to manipulate the graph and its properties via those tags. Using it involves typing commands in the text prompt of the status window.

Whenever a node's tag is loaded from a GFA or CSV file or on the prompt, the table with the same name is updated with the new value. For example, in GFA files the LN tag indicates the length of a node's sequence. Upon loading the GFA file, each S line with a non-empty sequence and/or an LN tag will add an element to the LN table. LN[name] then stores the value for a node labeled name in the GFA. This can be used for instance to change the color of all nodes matching some condition such as LN[name] < 1000 (more examples below).

The editing box is currently ugly and inconvenient due to limitations of the UI framework used, but this will be fixed soon. It has clipboard support (Ctrl-C for copy, Ctrl-V for paste, mouse selection).

The language is a fork of awk, hacked up to evaluate commands interactively. Any awk code valid within a pattern (inside braces) is also valid here, but functions cannot be defined yet. Currently, it's somewhat limited and a bit hacky, but it works. It will be improved later on.

Examples

Color nodes with labels matching a regular expression:

CL[i ~ /chr13/] = green

Color nodes with sequence length > 50:

CL[LN[i] > 50] = darkgreen

Color nodes which have a certain tag defined:

CL[i in CN] = purple

Color a specific node:

nodecolor("s1", red)
# also works but will loop through all nodes:
CL[i == "s1"] = red

Color 42th node in order of appearance in GFA (in S or L records):

nodecolor(label[42], blue)
# also works but will loop through all nodes:
CL[node[i] == 42] = blue

Color every other node:

CL[node[i] % 2 == 1] = orange

Look up a node by name, zooming in and centering on it if it exists:

findnode("s14")

Save current layout to file:

exportlayout("filepath")

Load existing layout from file:

importlayout("filepath")

Read in a CSV file:

readcsv("filepath")

See strawk.md for a more detailed overview.

Loading tags from CSV files

CSV files can be loaded at start up with the -c flag or at runtime uuto feed or modify tags for existing nodes. The '-c' flag may be used multiple times to load more than one CSV file at startup.

The first column is always reserved for node labels, and all subsequent columns are tags values. The first line must be a header specifying a tag name for each column.

For example:

Node,CN,CO,CL
utig4-142,1,55231,purple
utig4-143,0,53,red
...

The name of the first column does not matter. The CL tag is used as a node's color and can thus also be set in this way. Color can also be used. Node labels must refer to existing nodes from the input GFA file.

A CSV file may be loaded at runtime with the readcsv("file.csv") command (see Graph manipulation).

Loading multiple CSV files one after the other is allowed. In other words, variables such as color are not reset between files. CSV files thus needn't be merged together.

Format notes

The accepted format here is more stringent than usual. The implementation is not localized, ie. commas are always field separators. There are no escape sequences or special quoting rules, ie. quotes are not special characters. Line breaks within a field are disallowed. Lines must be terminated with a new line (LF) character, ie. the return character (CR) is not handled.

Each line must have the same number of fields as the header, but fields may be empty.

Example applications

One of the goals of strangepg is to enable experimentation with layouting. Currently, the default layouting algorithm honors a set of tags which set initial coordinates and/or fixes them (makes them unmovable).

They can be loaded from the GFA itself, from CSV or live by using the prompt.

See strawk.md for a more detailed overview.

Conga-line: linear layout in GFA segments order

fx[1] = 8 * node[i]
fy[1] = 0

Random coordinates

fx[1] = 8 * 1024 * node[i]
fy[1] = 8 * 1280 * rand()

Linear layout using bubble id tags from gaftools

min = 999999
max = -999999
for(i in BO) if(BO[i] < min) min = BO[i]; if(BO[i] > max) max = BO[i]
fx[CL[i] == orange] = 8 * (BO[i] - min - (max - min) / 2)
fy[CL[i] == orange] = 0
x0[CL[i] != orange] = 8 * (BO[i] - min - (max - min) / 2)
y0[CL[i] != orange] = 2 - 4 * rand() 

Linear layout of a DeBrujin graphs mapped back to a reference

... ...


## <a name="compilationsettings"></a>Additional compilation settings

#### Installation prefix

By default, the installation path is `$HOME/.local/bin`.
To install to a different directory, use the `PREFIX` make variable:

```bash
# example: install in /usr/local/bin via sudo:
make -j
sudo make PREFIX=/usr/local install

Compiler

To build using a different compiler, eg. clang, use the CC make variable:

make CC=clang -j

Tested with clang and gcc only.

Known bugs

Major bugs:

Used and bundled alien software

Data structures:

Linux graphics:

Used but not bundled:

strawk is based on onetrueawk.

9front

Build with mk instead of make in the usual manner. Additionally requires npe; it's really only required to build khash. A better solution might exist since SDL2 isn't used at all.

References

[1] Fruchterman, Thomas M. J.; Reingold, Edward M. (1991), "Graph Drawing by Force-Directed Placement", Software: Practice and Experience, 21 (11), Wiley: 1129–1164, doi:10.1002/spe.4380211102, S2CID 31468174