rjchallis / assembly-stats

Assembly statistic visualisation
http://genomehubs.org
MIT License
88 stars 85 forks source link

assembly-stats

=========================================================================

Updated code available

The code in this repository is no longer actively maintained. The latest code for generating snail plots is part of the BlobToolKit project. BlobToolKit supports interactive (see the public viewer instance and static assembly visualisation. To generate an assembly-stats snail plot on the command line, install blobtoolkit and follow the command line instructions

=========================================================================

Legacy code README

DOI

Assembly metric visualisations to facilitate rapid assessment and comparison of assembly quality.

Live demo

Latest and most complete documentation is available at assembly-stats.readme.io

Description

A de novo genome assembly can be summarised by a number of metrics, including:

assembly-stats supports two widely used presentations of these values, tabular and cumulative length plots, and introduces an additional circular plot that summarises most commonly used assembly metrics in a single visualisation. Each of these presentations is generated using javascript from a common (JSON) data structure, allowing toggling between alternative views, and each can be applied to a single or multiple assemblies to allow direct comparison of alternate assemblies.

Tabular presentation allows direct comparison of exact values between assemblies, the limitations of this approach lie in the necessary omission of distributions and the challenge of interpreting ratios of values that may vary by several orders of magnitude.

Screenshot

Cumulative scaffold length plots are highly effective for comparison of two or more assemblies, plotting both on a single set of axes reveals differences in assembled size and the N50 count very clearly. However, other metrics must still be tabulated or annotated on the plot for example N50 length and the longest scaffold length can be particularly difficult to determine from the plot alone. The scale for the axes is usually chosen to accommodate the data for a single assembly or set of assemblies, meaning that it is usually necessary to replot the data or consider the relative axis scales carefully to compare assemblies that have been plotted separately. The cumulative distribution plots in assembly-stats address the problem of scaling by allowing any combination of assemblies to be plotted together and allowing rescaling of the axes to fit any one of the individual assemblies.

Screenshot

The circular plots have been introduced to overcome some of the shortcomings of tabular and cumulative distribution plots in a visualisation that allows rapid assessment of most common assembly metrics. The graphic is essentially scale independent so assemblies of any size with different strengths and weaknesses produce distinct patterns that can be recognised at a glance. While side by side presentation of a pair of assemblies on consistently scaled axes allows direct comparison, the standard presentation is designed to facilitate assessment of overall assembly quality by consideration of the keys features from the plot.

Screenshot

plot descritption

basic usage

input format

Data to be plotted must be supplied as a JSON format object. As of version 1.1 data may be pre-binned to improve performance with assemblies containing potentially millions of contigs. The simplest way to generate this is using the asm2stats.pl or asm2stats.minmaxgc.pl perl scripts in the pl folder:

perl asm2stats.pl genome_assembly.fa > output.assembly-stats.json
perl asm2stats.minmaxgc.pl genome_assembly.fa > output.assembly-stats.json

This input format should be preferred as it improves performance and corrects for a bug in the javascript binning code by adjusting bin size to accommodate assembly spans that are not divisible by 1000, however the previous input format (with a full list of scaffold lengths is still supported).

usage

The simplest plot requires a target div, an assembly span, a count of ACGT bases, the GC percentage and an array of scaffold lengths, however it is best to use the asm2stats.pl/asm2stats.minmaxgc.pl perl scripts described above to generate a richer, pre-processed input format. See the Danaus_plexippus_v3.assembly-stats.json file for a complete example using pre-binned data, basic usage is detailed below:

<div id="assembly_stats">
<script>
  d3.json("Danaus_plexippus_v3.assembly-stats.json", function(error, json) {
    if (error) return console.warn(error);
    asm = new Assembly (json);
    asm.drawPlot('assembly_stats');
  })
</script>

If called using javascript in a custom html file as above, the file can have any name, but for use with the example assembly-stats.html file, the json filename should match the pattern <assembly-name>.assembly-stats.json. This needs to be hosted as a webpage in order to run, if you would rather run this using github pages than set up a local webserver, follow the instructions by @ammaraziz in this fork.

Alternatively use python http.server as suggested by @hung-th by executing the command python -m http.server 8080 in the assembly-stats directory, then visit http://0.0.0.0:8080/assembly-stats.html?path=json/&assembly=output&view=circle&altView=cumulative&altView=table in a web browser (assuming the json file is named output.assembly-stats.json).

The json object contains the following keys:

Additional data will be plotted, if added to the stats object including:

While the plots were conceived as scale independent visualisations, there are occasions when it is useful to compare assemblies on the same radial (longest scaffold) or circumferential (assembly span) scales. These scales may be modified on the plot by clicking the grey boxes under the scale heading. Plots can also be drawn with an specific scale by supplying additional arguments to drawPlot().

For example to scale the radius to 10 Mb and the circumference to 400 Mb (values smaller than the default will be ignored):

  asm.drawPlot('assembly_stats',10000000,400000000);

It is also possible to programmatically toggle the visibility of plot features by passing an array of classnames to toggleVisible():

  asm.toggleVisible(['asm-longest_pie','asm-count']);