Pannagram

Pannagram is a package for constructing pan-genome alignments, analyzing structural variants, and translating annotations between genomes. Additionally, Pannagram contains useful functions for visualization. The manual is available in the examples folder.

Recreating working environment

Linux users

Make sure you have Conda or Mamba installed. To create and activate the package environment run:

conda env create -f pannagram.yaml
conda activate pannagram
# OR
mamba env create -f pannagram.yaml
mamba activate pannagram

The environment downloads required R interpreter version and all needed libraries, including BLAST, MAFFT and others.

MacOS users

should also run:

brew install coreutils

to make sure all the needed shell commands are installed.

Windows users

Can try running code from this repo under WSL (as Bash and / path separator are used extensively in the code). Nevertheless it was never tested in such environment, so good luck.

1. Pangenome linear alignment

1.1 Building the alignment

Pangenome alignment can be built in two modes:

reference-free:

./pannagram.sh -path_in '<genome files directory path>' \
-path_out '<output files path>' \
-cores 8

reference-based:

./pannagram.sh -ref '<reference genome name>' \
-path_in '<genome files directory path>' \
-path_out '<output files path>' \
-cores 8

quick look: If there is no information on genomes and corresponding chromosomes available, one can run preparation steps:

./pannagram.sh -ref '<reference genome name>' \
-path_in '<genome files directory path>' \
-path_out '<output files path>' \
-cores 8 -pre

An extended description of the parameters for all three scripts are avaliable by executing scripts with the flag -help.

1.2 Extract information from the pangenome alignment

Synteny blocks, SNPs, and sequence consensus (for the IGV browser) can be extracted from the alignment:

./analys.sh -path_msa '<output path with consensus>' \
      -path_chr '<path with chromosomes>' \
      -blocks  \  # Find Synteny block inforamtion for visualisation
      -seq  \     # Create consensus sequence of the pangenome
      -snp        # SNP calling

1.3 Calling structural variants

When the pangenome linear alignment is built, SVs can be called using the following script:

./analys.sh -path_msa '<output path with consensus>' \
      -sv_call  \         # Create output .gff and .fasta files with SVs
      -sv_sim te.fasta \  # Compare with a set of sequences (e.g., TEs)
      -sv_graph           # Construct the graph of SVs

2. Visualisation

Pannagram contains a number of useful methods for visualization in R.

2.1 Visualisation of the pangenome alignment

All genomes together:

A dotplot for a pair of genomes:

2.2 Graph of Nestedness on Structural variants

Every node is an SV:

Every node is a unique sequence, size - the amount of this sequence in SVs:

2.3 Nucleotide plot for a fragment of the alignment

In the ACTG-mode:

# --- Quick start code ---
source('utils/utils.R')             # Functions to work with sequences
source('visualisation/msaplot.R')   # Visualisation
aln.seq = readFastaMy('aln.fasta')  # Vector of strings
aln.mx = aln2mx(aln.seq)            # Transfom into the matrix
msaplot(aln.mx)                     # ggplot object

In the Polymorphism mode:

# --- Quick start code ---
msadiff(aln.mx)                     # ggplot object

2.4 Dotplots of Sequences

Simultaneously on forward (dark color) and reverse complement (pink color) strands:

# --- Quick start code ---
source('utils/utils.R')             # Functions to work with sequences
source('visualisation/dotplot.R')   # Visualisation
s = sample(c("A","C","G","T"), 100, replace = T)
dotplot(s, s, 15, 9)                # ggplot object

2.5 ORF-finder and visualisation

# --- Quick start code ---
source('utils/utils.R')             # Functions to work with sequences
source('visualisation/orfplot.R')   # Visualisation
str = nt2seq(s)
orfs = orfFinder(str)
orfplot(orfs$pos)                   # ggplot object

3. Additional useful tools

3.1 Search for similar sequences

...on the genome

The first approach involves searching against entire genomes or individual chromosomes. The quickstart toy-example is:

./simsearch.sh -in_seq genes.fasta -on_genome genome.fasta -out out.txt

The result is a GFF file with hits matching the similarity threshold.

...on another set

The second approach, in contrast, is designed to search for similarities against another set of sequences. The quickstart toy-example is:

./simsearch.sh -in_seq genes.fasta -on_seq genome.fasta -out out.txt

The result is an RDS (R Data Structure) table. This table shows the coverage of one sequence over another and includes a flag column that indicates whether the sequences meet the similarity threshold. Additionally, the second script takes into account the coverage strand, determining not just if a sequence is covered, but also if it's covered in a specific orientation.

Acknowledgements

Development:

Anna Igolkina - Lead Developer and Project Initiator
Alexander Bezlepsky - Assistant

Testing:

Anna Igolkina: Lead Tester
Anna Glushkevich: Testing the alignment on A. lyrata genomes
Elizaveta Grigoreva: Testing the alignment on A. thaliana and A. lyrata genomes
Jilong Ma: Testing the SV-graph on spider genomes
Alexander Bezlepsky: Testing the Pannagram's functionality on Rhizobial genomes
Gregoire Bohl-Viallefond: Testing the annotation converter on A. thaliana alignment

Resources:

Logo was generated with the help of DALL-E
Parallel Processing Tool: O. Tange (2018): GNU Parallel 2018, ISBN 9781387509881, DOI https://doi.org/10.5281/zenodo.1146014.

iganna / pannagram

readme