Panagram: Interactive, alignment-free pan-genome browser

Katie Jenike, Sam Kovaka, Shujun Ou, Stephen Hwang, Srividya Ramakrishnan, Ben Langmead, Zach Lippman, Michael Schatz

Please note: installation instructions and pre-processing scripts are a work in progress.

Installation

git clone --recursive https://github.com/kjenike/panagram.git
cd panagram
pip install .

The --recursive option is required to install the KMC dependency. If you forget to include it, you can update the repository with the command git submodule update --init.

Installation may fail if pip is not up-to-date or if setuptools is not up-to-date. In order to update pip and setuptools run:

pip install --upgrade pip
pip install --upgrade setuptools

Dependencies

Requires python version >=3.7, pip, samtools, and tabix. All other dependencies should be automatically installed via pip.

Panagram relies on KMC to build its kmer index. This should be installed automatically, however it is possible that the KMC installation will fail but panagram will successfully install. In this case panagram view can be run, but panagram index will return an error. You may be able to debug the KMC installation by running make -C KMC py_kmc_api and attempting to fix any errors, then re-run pip install -v . after the errors are fixed.

Running

Panagram runs in two steps, the pre-processing step (index command) and the viewing (view command).

Preprocessing

Usage: Anchor KMC bitvectors to reference FASTA files to create pan-kmer bitmap

usage: panagram index [-h] <config.toml>

See example config.toml file for more details on the layout. Must include paths to all of the fasta files and optionally any annotations in gff format.

Panagram may fail to index datasets with more than 32 genomes. This is not a fundamental limitation, and we are working on fixing it.

Currently genome IDs should only contain alphanumeric characters and underscores due to KMC requirements.

View

Usage: Display panagram viewer

usage: panagram view [-h] <index_dir/> [genome] [chrom] [start] [end]
  index_dir           Panagram index directory
  genome              Initial anchor genome (optional) 
  chrom               Initial chromosome (optional) 
  start               Initial start coordinate (optional) 
  end                 Initial end coordinate (optional)
  --ndebug            Run server in production mode (important for a public-
                      facing server)
  --port str          Server port (default: 8050)
  --host str          Server address (default: 127.0.0.1)
  --url_base str      A local URL prefix to use app-wide (passed to
                      Dash.dash(url_base_pathname=...)) (default: /)
  --bookmarks str     Bed file with bookmarked regions (default: None)

Runs a local Dash server. Browser can be viewed at http://127.0.0.1:8050/ by default.

Bitdump

Usage:

usage: panagram bitdump [-h] [-v bool] index_dir coords step
  Query pan-kmer bitmap generated by "panagram index"/

  index_dir             Panagram index directory
  coords                Coordinates to query in chr:start-end format
  step                  Spacing between output kmers (optimized for multiples
                        of 100) (default: 1)
  -v bool, --verbose bool
                        Output the full bitmap (default: False)

Example run

First download the example_data.zip bacterial data from: http://data.schatz-lab.org/panagram/

Direct link

Unzip the archive and you will find 5 bacterial genomes plus their annotations

unzip example_data.zip

To run, first index the genomes:

cd example_data
panagram index conf.toml

It is super important that any gff files are in the correct format. GFF format is supported. We strongly suggest that if you run into any problems you first check the format annotation format. This can be done with command line tools like gff3validator or online here: https://genometools.org/cgi-bin/gff3validator.cgi

Then you can panagram to visualize (from the example_data directory):

panagram view .

From there, you can view the results in your webbrowser at http://127.0.0.1:8050/

Hosting and Proxies

Panagram uses Dash to serve the plotly visualizations. By default the dedicated webserver runs on localhost (127.0.0.1) on port 8050, but you can reverse proxy to a different port and path using a web engine such as nginx

For nginx, first reconfigure your nginx configuration file to add (note to be very careful with the use of the slash ('/') character):

    location /panagram {
      proxy_pass http://127.0.0.1:8050;
    }

The retart nginx with

systemctl stop nginx
systemctl start nginx

For a secure public-facing server, be sure to run with the option panagram view --ndebug to disable debug mode.

You may also wish to change the base URL path with the --url_base option, for example to something like --url_base /panagram/. The port and host name can be specified by the --port and --host options.

Finally you will need to run panagram using panagram view <dir>. You will probably want to run this in a loop in case it needs to be restarted, such as:

until panagram view --ndebug .; do echo "restarting"; sleep 1; done

We will optimize this process in future releases.

Example config.toml file

k = 12
prefix = "."
processes = 5

lowres_step = 100
chr_bin_kbp = 200

gff_anno_types = ["exon", "CDS"]

[kmc]
memory = 10
processes = 5
threads = 4

[fasta]
ecoli = "FASTAS/ecoli_GCF_001612495.1_ASM161249v1_genomic.fna"
ecoli_k12 = "FASTAS/ecoli_k12_GCF_000005845.2_ASM584v2_genomic.fna"
klebsiella = "FASTAS/klebsiella_GCF_000240185.1_ASM24018v2_genomic.fna"
salmonella = "FASTAS/salmonella_GCF_016117835.1_ASM1611783v1_genomic.fna"
shigella = "FASTAS/shigella_GCF_000006925.2_ASM692v2_genomic.fna"

[gff]
ecoli = "gffs/ecoli_GCF_001612495.1_ASM161249v1_genomic.gff"
ecoli_k12 = "gffs/ecoli_k12_GCF_000005845.2_ASM584v2_genomic.gff"
klebsiella = "gffs/klebsiella_GCF_000240185.1_ASM24018v2_genomic.gff"
salmonella = "gffs/salmonella_GCF_016117835.1_ASM1611783v1_genomic.gff"
shigella = "gffs/shigella_GCF_000006925.2_ASM692v2_genomic.gff"

Using Snakemake (dev branch)

The dev branch, while actively being developed, currently utilizes Snakemake. This is straightforward to use, you just need a tsv file with a list of samples and corresponding fasta files.

Example tsv file:

name    fasta   gff id  anchor
ecoli   FASTAS/ecoli_GCF_001612495.1_ASM161249v1_genomic.fna    ANNO/ecoli_GCF_001612495.1_ASM161249v1_genomic.gff  0   True
ecoli_k12   FASTAS/ecoli_k12_GCF_000005845.2_ASM584v2_genomic.fna   ANNO/ecoli_k12_GCF_000005845.2_ASM584v2_genomic.gff 1   True
klebsiella  FASTAS/klebsiella_GCF_000240185.1_ASM24018v2_genomic.fna    ANNO/klebsiella_GCF_000240185.1_ASM24018v2_genomic.gff  True
salmonella  FASTAS/salmonella_GCF_016117835.1_ASM1611783v1_genomic.fna  ANNO/salmonella_GCF_016117835.1_ASM1611783v1_genomic.gfTrue
shigella    FASTAS/shigella_GCF_000006925.2_ASM692v2_genomic.fna    ANNO/shigella_GCF_000006925.2_ASM692v2_genomic.gff  4  True

Known issues

Right now, there is a bug (issue #7) when indexing very large genomes with very large chromosomes. We are activley working to fix this.
Indexing sometimes fails when working with more than 32 genomes
Mash dendogram leaf placement is not always perfect
Installing on a mac can be tricky. Will need to include a more detailed list of dependancies

Ideas for improvement

Add a row for gene coverage (rather than just gene density) for the third tab.
Update the step size in the control panel.
Add the actual sequence.

kjenike / panagram

readme