symgenoevolab / SyntenyFinder

SyntenyFinder contains a collection of scripts to easily perform and visualize macrosynteny analysis when provided with the accessions of annotated chromosome-level genome assemblies from NCBI.
16 stars 6 forks source link

SyntenyFinder (1)

1. Description

This repository contains a series of scripts which can be used to perform and visualise macrosynteny analysis on chromosome level genomes of metazoan species.

SyntenyFinder.py is a python script to run on command line. Given a list of NCBI genome assembly accessions, the script downloads the associated annotated genomes, runs OrthoFinder, and automatically generates the karyotype and coordinate files necessary to run synteny analysis in RIdeogram. For genomes without a published annotation on NCBI, the folder SyntenyFinder_customisable contains Synteny_main.ipynb, which is a fully customisable script which can intake a wide range of annotation formats. plot_ideogram.R takes the generated files and creates macrosynteny plots.

Figure 1: Evolution of genome structure in bryozoans (Lewin, et al. 2024). Figure created using svg files generated by the SyntenyFinder pipeline and pieced together using Adobe Illustrator.

File tree for SyntenyFinder:

SyntenyFinder

├── SyntenyFinder.py
├── dependencies
│   └── Synteny_functions.py
├── plot_ideogram
│   └── plot_ideogram.R
├── sample_files
│   ├── RIdeogram_output.png
│   └── sample_figure.png
└── SyntenyFinder_customisable
    ├── Synteny_main.ipynb
    ├── dependencies
    │   └── Synteny_functions.ipynb
    ├── example_run.ipynb
    ├── input_data (*)
    │   ├── gene_rows
    │   │   └── ...
    │   ├── genomes
    │   │   └── ...
    │   └── proteomes
    │       └── ...
    └── synteny_v5.5.R

Items marked with a (*) are not provided in this repository.

2. Files in this repository

3. Dependencies

Bash:

SyntenyFinder.py relies on OrthoFinder to identify single-copy orthologous genes. Please first make sure OrthoFinder and all necessary dependencies are correctly installed.

Additionally, SyntenyFinder.py uses NCBI's command line tools datasets and dataformat to download the requested accessions and pull assembly information such as species and number of chromosomes. Please ensure these are installed and running correctly.

Python:

The majority of packages used by SyntenyFinder are part of the Python Standard Library; namely subprocess, os, re, zipfile, argparse, concurrent.futures, and io. These should be available by default.

Additionally, Pandas and Biopython are required. These can be installed with the following commands:

pip install pandas

pip install biopython

4. Usage

Documentation

Use the following command: python /path/to/SyntenyFinder.py --help

Quick start

To generate karyotype and coordinate files, run the following command in /SyntenyFinder:

python SyntenyFinder.py --accessions GCF_910592395.1,GCF_902652985.1 --run_name get_synteny

This command downloads the annotated genome assemblies GCF_910592395.1 for the nemertean Lineus longissimus and GCF_902652985.1 for the scallop Pecten maximus. It runs OrthoFinder and finds single-copy orthologues, then generates karyotype and coordinate files. The folder get_synteny is created within the working directory /SyntenyFinder to store output, as well as intermediate run files.

Full command:

python /path/to/SyntenyFinder.py \
--accessions accession1,accession2,accession3 \
--run_name run_name1 \
--algs first \
--orthofinder path/to/orthofinder \
--threads 20 \
--directory path/to/root/folder

Flags

Output:

Running SyntenyFinder.py results in the generation of the following tree of files:

directory (root directory provided) 
├── ncbi_downloads
│   ├── <Sp1>
│   │   └── ncbi_dataset
│   │       └── data
│   │           └── accession1
│   ├── <Sp2>
│   │   └── ncbi_dataset
│   │       └── data
│   │           └── accession2
│   └── ...
└── run_name
    ├── output
    │   ├── Sp1_coordinates.tsv
    │   ├── Sp1_karyotype.txt
    │   ├── Sp2_coordinates.tsv
    │   ├── Sp2_karyotype.txt
    │   ├── Sp3_coordinates.tsv
    │   ├── Sp3_karyotype.txt
    │   └── ...
    └── run_files
        ├── orthofinder_output
        │   └── Results_MmmDD
        │       ├── ...
        │       ├── Orthogroups
        │       └── ...
        └── run_proteomes

The files XXX_coordinates.tsv and XXX_karyotype.txt are the intake files necessary for running RIdeogram to generate a macrosynteny figure. Copy them to plot_ideogram/input and generate the plots using plot_ideogram.R.

Notes:

5. Plotting in R

The script SyntenyFinder/plot_ideogram/plot_ideogram.R takes the karyotype and coordinate files generated by SyntenyFinder and plots them using RIdeogram.

Figure 2: Raw figure generated using SyntenyFinder pipeline

To use this script, first copy the output files from root_directory/run_name/output to plot_ideogram/input. Then create the directories indicated with an (*) as shown below for the output files of plot_ideogram.R.

SyntenyFinder
└── plot_ideogram
    ├── ideograms (*)
    │   ├── pdf (*)
    │   └── svg (*)
    ├── input (*)
    │   └── <copy output from SyntenyFinder>
    └── plot_ideogram.R

This script generates SVG and PDF files which can then be modified in illustrator to create the final figures.

6. Additional information

Citation

Please cite our preprint if you use this pipeline:

T. D. Lewin, I. J.-Y. Liao, M.-E. Chen, J. D. D. Bishop, P. W. H. Holland, Y.-J. Luo, ​​Fusion, Fission, and Scrambling of the Bilaterian Genome in Bryozoa. bioRxiv, 2024.02.15.580425 (2024).