slimsuite / chromsyn

Chromosome-level synteny plotting using orthologous regions
GNU General Public License v3.0
28 stars 5 forks source link

ChromSyn: Chromosome-level synteny plotting using orthologous regions

ChromSyn is designed to compile a set of BUSCO runs with the same version and lineage into chromosome synteny plots. This is achieved by establishing blocks of synteny based on co-linear regions that share an identifier. Whilst ChromSyn is designed with BUSCO in mind, it is therefore fairly simple to use alternative sources of synteny. Future releases will expand options for replacing BUSCO, so please get in touch if this would be useful to you.

Version

The current version should be v1.6.1. (Check the chromsyn.R file to be sure!)

Citation

If you use ChromSyn in your work, please cite the bioRxiv paper describing the Myrtle Rust genome:

Dependencies

You will need to have R with tidyverse installed along with the RColorBrewer and gtools packages. The writexl package is also needed for Excel output.

Quick start

NOTE: A slightly more detailed description (and newer version of the output) is available through the ChromSyn Walkthrough. Better documentation is in development, so please ask questions if anything is unclear!

The simplest way to use ChromSyn is using BUSCO predictions. Two inputs are required:

These are connected via the genome name for each assembly, and have contents in the simple form genome file with one line per genome:

GENOME1 FILENAME1
GENOME2 FILENAME2
...
GENOMEn FILENAMEn

One easy way to generate this input:

Step 1. Compile BUSCO results into a single directory and name them $GENOME.busco5.tsv.

Step 2. Name each input $GENOME.fasta and filter to just the chromosomes. (Can be done before or after running BUSCO but don't change sequence names if done afterwards.)

Step 3. Run Telociraptor to build a sequences table per genome and copy the *.telomeres.tdt and *.gaps.tdt files into the directory:

python $TELOCIRAPTOR/code/telociraptor.py -seqin $FASTA -basefile ${FASTA/.fasta/} i=-1 tweak=F telonull=T

Version 0.6.0 introduced a new optional input of TIDK search output:

Version 0.7.0 introduced a new optional input of assembly gaps (SeqSuite output) and features of interest:

If the features have one of the open shapes (21-25), they will have a black border and use the Col value for the fill.

Version 1.2.0 introduced another optional input of predicted copy number output from DepthKopy, which can be used to colour Duplicated BUSCO genes and loaded features according to their predicted copy number: cndata=TSV. This should be a delimited file with fields: Genome, SeqName, Start, End, CN (Other fields OK). If Genome is not provided, SeqName will map onto any genome. Start and End will be converted into Pos (mean position) prior to mapping, which is used for the actual mapping. (Pos can be provided in place of Start and End.)

Version 1.5.0 has updated the options for using non-BUSCO synteny data. Specifically, ChromSyn has been optimised to use output from SynBad, including custom formatting of assembly gaps based on their synteny and/or spanning read support. "Good" gaps can be optionally hidden with qcmode=T. If using for assembly QC, a new ybleed setting can be used to extend synteny blocks into chromosomes. This works best for a single pair of assemblies with ybleed=1, but if you want a clearer sense of how synteny blocks line up, ybleed=0.5 will extend them to the midline of each chromosome.

Step 4. Make the FOFN files, e.g.:

for TSV in *.busco5.tsv; do
  echo ${TSV/.busco5.tsv/} $TSV >> busco.fofn
  echo ${TSV/.busco5.tsv/} ${TSV/.busco5.tsv/.telomeres.tdt} >> sequences.fofn
  echo ${TSV/.busco5.tsv/} ${TSV/.busco5.tsv/.gaps.tdt} >> gaps.fofn
done

Step 5. Edit sequences.fofn if required to set the vertical ordering of the species. (By default, the vertical ordering will match this file.)

Step 6. Run the chromsyn script:

Rscript $PATH/chromsyn.R basefile=$OUTPUTPREFIX | tee $OUTPREFIX.runlog

Version 1.5.0 introduced a dynamic setting of the minregion length setting (default 50 kb) to avoid breaking the plotting function with thousands of tiny synteny blocks. Setting maxregions=INT will set the maximum number of independent regions after merging of adjacent synteny blocks. (Set maxskip=-1 to keep all synteny blocks separate.)

Step 7. Open up $OUTPREFIX.pdf to see the plot.

Outputs

The main output is one or more chromosome synteny plots and (if writexl is installed) an Excel file containing the established synteny.

Example plot of DNA Zoo marsupials. Each row is a different genome assembly. Each rectangle is an assembly scaffold. Reversed scaffolds have an R suffix. Default units are Mb, with tick marks every 10 Mb. Black and blue dots indicate Diploidocus and 3' TIDK telomere predictions. Other symbols mark annotated features, which are rRNA gene predictions in this example. Vertical placement indicates strand.

NOTE: ChromSyn will try to generate a single synteny plot. However, where there are many synteny blocks and rearrangements, system resources for R are sometimes exceeded. In this case, multiple plots will be produced that each contain a subset of the pairwise synteny comparisons. These can then be manually combined into a single figure.

NOTE: if provided, assembly gaps will be plotted as dark red plus signs.

Options

A more detailed description of options and use cases will be added in time. The following options can be provided in the form argument=value to alter inputs and/or outputs:

# : sequences=FOFN = File of PREFIX FILE with sequence names and lengths (name & length, or SeqName & SeqLen fields) [sequences.fofn]
# : busco=FOFN = File of PREFIX FILE with full BUSCO table results. Used to identify orthologous regions. [busco.fofn]
# : tidk=FOFN = Optional file of PREFIX FILE with TIDK search results. [tidk.fofn]
# : gaps=FOFN = Optional file of PREFIX FILE with TIDK search results. [gaps.fofn]
# : ft=FOFN = Optional file of PREFIX FILE with TIDK search results. [ft.fofn]
# : regdata=TSV = File of Genome, HitGenome, SeqName, Start, End, Strand, Hit, HitStart, HitEnd
# : regmirror=T/F = Whether to mirror the regdata input [False]
# : cndata=TSV = File of Genome, SeqName, Start, End, CN (Other fields OK)
# : focus=X = If given will orient all chromosomes to this assembly
# : orient=X = Mode for sequence orientation (none/focus/auto)
# : seqsort=none/focus/auto/FILE = Optional ordering strategy for other assemblies [auto]
# : seqorder=LIST = Optional ordering of the chromsomes for the focal assembly
# : restrict=LIST = List of sequence names to restrict analysis to. Will match to any genome.
# : order=LIST = File containing the Prefixes to include in vertical order. If missing will use sequences=FOFN.
# : chromfill=X = Sequences table field to use for setting the colouring of chromosomes (e.g. Genome, SeqName, Type or Col) [Genome]
# : qcmode=T/F = QC mode that will hide gaps rated as "good" by SynBad [False]
# : synbad=FILE = File of custom shapes and colours for SynBad gap ratings []
# : runpath=PATH = Run ChromSyn in this path (will look for inputs etc. here) [./]
# : basefile=FILE = Prefix for outputs [chromsyn]
# : plotdir=PATH = output path for graphics
# : minlen=INT = minimum length for a chromosome/scaffold to be included in synteny blocks/plots [0]
# : minregion=INT = minimum length for mapped regions to be included in plots [50000]
# : maxregions=INT = maximum number of mapped regions to be included in plots [0]
# : minbusco=INT = minimum number of BUSCO genes to be included in Syntenic block [1]
# : minftlen=INT = minimum number of for features to be plotted [1]
# : maxskip=0 = maximum number of BUSCO genes to skip and still be a syntenic block [0]
# : orphans=T/F = whether to include scaffolds that have no BUSCO genes [True]
# : tidkcutoff=INT = TIDK count cutoff for identifying a telomere [50]
# : align=X = alignment strategy for plotting chromosomes (left/right/centre/justify) [justify]
# : ygap=INT = vertical gap between chromosomes [4]
# : ypad=NUM = proportion of ygap to extend synteny blocks before linking [0.1]
# : ybleed=NUM = proportion of chromosome to bleed synteny block plotting into [0]
# : scale=X = units in basepairs for setting the x-axis scale (bp/kb/Mb/Gb/pc) [Mb]
# : textshift=NUM = offset for printing chromosome names [0.3]
# : ticks=INT = distance between tickmarks [5e7]
# : pdfwidth=NUM = PDF width [20]
# : pdfheight=NUM = over-ride for standard calculated PDF height [0]
# : pdfscale=NUM = over-ride for PDF output scale [1]
# : ftsize=NUM setting to control the size of telomere and feature points.
# : namesize=NUM = scaling factor for the Genome names in PDF plots [1]
# : labelsize=NUM = scaling factor for the chromosome names in PDF plots [1]
# : labels=T/F = whether to print chromosome name labels (True) or legend (False) [TRUE]
# : opacity=NUM = Opacity of synteny regions (0-1) [0.3]
# : debug=T/F = whether to switch on additional debugging outputs [FALSE]
# : dev=T/F = whether to switch on dev mode during code updates [FALSE]

Running within R(Studio)

To use the script within other R code, the commandline arguments can be over-ridden with a vector of commandline arguments named override. For example, the code that generated the example plot:

override <- c("basefile=zoomarsupials","busco=zoomarsupials.busco.fofn","sequences=zoomarsupials.sequences.fofn","tidk=zoomarsupials.tidk.fofn","ft=zoomarsupials.ft.fofn","gaps=FALSE","orphans=F","minlen=1e7","focus=Wombat")
source("chromsyn.R")

Algorithm overview

Chromosome synteny analysis is performed using single-copy “Complete” genes from BUSCO genome completeness runs. Synteny blocks between pairs of assembly are determined as collinear runs of matching BUSCO gene identifications. For each assembly pair, BUSCO genes rated as “Complete” in both genomes are ordered and oriented along each chromosome. Synteny blocks are then established as sets of BUSCO genes that are collinear and uninterrupted (i.e. sharing the same order and strand) in both genomes. A synteny block starts at the start of a BUSCO gene and will keep extending all the time the same BUSCO gene is found next in the same direction in both assemblies. It will end at the end of the last BUSCO gene in the block. (Local rearrangements and breakdowns of synteny between BUSCO genes will not be identified and may be falsely marked as syntenic within a synteny block.)

Settings to control this behaviour include:

NOTE: Setting minregion to less than zero will keep all BUSCO genes a separate synteny regions. This can be useful for individual chromosome plots, but could cause plotting problems if done for an entire genome.

Chromosome synteny is then visualised by arranging the chromosomes for each assembly in rows and plotting the synteny blocks between adjacent assemblies. In each case, the chromosome order and orientation is set for one “focus” assembly, and the remaining assemblies arranged in order to maximise the clarity of the synteny plot, propagating out from the focal genome. For each chromosome, the “best hit” in each other assembly is established as that with the maximum total length of shared synteny blocks and an anchor point established as the mean position along the best hit chromosome of those synteny blocks. Where the majority of synteny is on the opposite strand, the chromosome is reversed and given an “R” suffix in the plot. Chromosomes are then ordered according to their best hits and, within best hits, the anchor points. Synteny blocks sharing the same orientation are plotted blue, whilst inversions are plotted in red.

Providing Synteny Blocks

It is possible to run ChromSyn without BUSCO data, using the regdata=TSV option. This should be a tab-delimited file with the fields: Genome, HitGenome, SeqName, Start, End, Strand, Hit, HitStart, HitEnd. (Any extra fields are ignored.) If you have already run ChromSyn, it’s the same format as the “Regions” sheet in the *.chromsyn.xlsx output file, without the last three fields.

NOTE: This file should be unidirectional, i.e. Genome:HitGenome hits will not be seen as HitGenome:Genome hits. (This is do with the way that ChromSyn establishes the ordering of sequences.) By default (as of v1.5.0), regmirror=TRUE, which will reciprocate all hits and remove redundancy to fill in the blanks.