molevol-ub / galeon

7 stars 0 forks source link

GALEON

A Comprehensive Bioinformatic Tool to Analyse and Visualise Gene Clusters in Complete Genomes

Software Logo

To facilitate the identification, analysis, and visualisation of physically clustered gene family genes within chromosome-level genomes, we introduce GALEON, a user-friendly bioinformatic tool. GALEON identifies gene clusters by studying the spatial distribution of pairwise physical distances among gene family members along with the genome-wide gene density. The pipeline also enables the simultaneous analysis and comparison of two gene families, and allows the exploration of the relationship between physical and evolutionary distances. This tool offers a novel approach for studying the origin and evolution of gene families.

GALEON documentation can be also be found in: http://www.ub.edu/softevol/galeon

Version history

V1: Initial release

Contents

  1. Installation and prerequisites
  2. Input data
  3. Running GALEON
  4. Example data
  5. Citation
  6. Troubleshooting

1. Installation and prerequisites

GALEON is distributed as a set of scripts that can be called from Galeon_masterScripts folder, but do not require any specific installation or compilation step. However, the pipeline does require several python modules, R packages, as well as external software. All of them are listed in Section 1.2, 1.3.

It is highly recommened to install the conda Galeon environment that provides all of the required python packages as well as some of the external programs, specifically pandoc and mafft.

1.1. Install GALEON

# 1-Download the software
git clone https://github.com/molevol-ub/galeon.git
cd galeon

# 2-Make the binaries executable
chmod +x GALEON_masterScripts/bin/*

# 3-Activate conda and install the Galeon conda environment
conda activate
conda env create -f GaleonEnv.yml

# 4-Activate the environment
conda activate Galeon

# 5-Run the configuration script
# this will add a header like this “#!/home/user/miniconda3/envs/Galeon/bin/python” to the python scripts
python Configure.py YOURPATH_to/GALEON_masterScripts

Dependencies installation checkpoint

Once all the packages have been installed, run the following command to check that all the dependencies are available and accessible.

# 6-Enter to the GALEON_masterScripts directory and run the following script
cd GALEON_masterScripts
python Scripts/Check_installed_packages_and_PythonEnv.py 

If you encounter any errors related to the software (bedtools, mafft, iqtree2, FastTree), check the help message to add the path to your own installation.

In addition, note that R and two R packages, rmarkdown and DT, need to be installed (see 1.3 to install them).

python Scripts/Check_installed_packages_and_PythonEnv.py -h

Export GALEON to PATH

GALEON scripts should be preferably added to PATH to have general access.

# 7-Export the path_to_GALEON to PATH
nano ~/.bashrc

# add this line: export PATH=YOURPATH_to/GALEON_masterScripts:$PATH
# save and exit
# run
source ~/.bashrc

# 8-Check the accessibility to the Galeon control script
which GALEON_ControlScript.py
# now it should output: YOURPATH_to/GALEON_ControlScript.py

1.2. Python Packages

It is highly recommended to use conda since all the python packages will be easily installed with the Galeon conda environment (see Section 1.1), as well as some of the required software: mafft and newick_utils. Alternatively, you may install them separately using pip, consult the appropriate documentation for each of them.

argparse
ast
collections
copy
gc
itertools
matplotlib
numpy
operator
os
pandas
re
seaborn
shutil
string
scipy
subprocess
sys
time

1.3. R Packages

Make sure to have R installed, as well as two additional R packages: rmarkdown and DT, which are needed to create the final Report in HTML format. These packages can be installed as follows directly from the R terminal.

# 1-Open a terminal and run “R”
R

# 2-Install packages
>install.packages("rmarkdown")
>install.packages("DT")

# 3-Check that they can be loaded
>library("rmarkdown")
>library("DT")

1.4. Additional software

The following programs must be installed and available from command line: pandoc, mafft, bedtools, FastTree and iqtree2.

We provide a bin directory with binaries of bedtools, FastTree and iqtree2. If the Galeon conda environment is created, pandoc, mafft and newick_utils should be available upon environment activation.

Notes:

Alternatively, check the corresponding documentation for installation instructions.

Tested software versions:

Known issues

2. Input data

Warning: Please be careful while preparing the inputs, we recommend to carefully read and follow the below instructions. Input file name structure and formats are mandatory.

GALEON uses three types of files for each gene family:

2.1. Annotation files

All the input coordinate files MUST be provided in the same file format.

2.1.1. GFF3 format

scaffold source feature start end score strand frame attribute
Scaffold_14804_HRSCAF_18385 AnnotGFF gene 41841903 41843055 0.51 - . ID=g10232;blastphmmer;annot;Pos:1-383
Scaffold_14804_HRSCAF_18385 AnnotGFF mRNA 41841903 41843055 0.51 - . ID=g10232.t1;Parent=g10232;blastphmmer;annot;Pos:1-383
Scaffold_14804_HRSCAF_18385 AnnotGFF CDS 41841904 41843052 0.51 - 0 ID=g10232.t1.CDS1;Parent=g10232.t1;blastphmmer;annot;Pos:1-383
Scaffold_14804_HRSCAF_18385 AnnotGFF gene 47268322 47268742 0.67 + . ID=g10331;blastphmmer;annot;Pos:1-139
Scaffold_14804_HRSCAF_18385 AnnotGFF mRNA 47268322 47268742 0.67 + . ID=g10331.t1;Parent=g10331;blastphmmer;annot;Pos:1-139
Scaffold_14804_HRSCAF_18385 AnnotGFF CDS 47268322 47268738 0.67 + 0 ID=g10331.t1.CDS1;Parent=g10331.t1;blastphmmer;annot;Pos:1-139
Scaffold_14804_HRSCAF_18385 AnnotGFF gene 47277448 47277868 0.63 + . ID=g10332;blastphmmer;annot;Pos:1-139
Scaffold_14804_HRSCAF_18385 AnnotGFF mRNA 47277448 47277868 0.63 + . ID=g10332.t1;Parent=g10332;blastphmmer;annot;Pos:1-139
Scaffold_14804_HRSCAF_18385 AnnotGFF CDS 47277448 47277864 0.63 + 0 ID=g10332.t1.CDS1;Parent=g10332.t1;blastphmmer;annot;Pos:1-139

2.1.2. BED2 format

Scaffold ID start end attribute
Scaffold_14804_HRSCAF_18385 41841903 41843055 g10232
Scaffold_14804_HRSCAF_18385 47268322 47268742 g10331
Scaffold_14804_HRSCAF_18385 47277448 47277868 g10332

2.1.3. BED1 format

Scaffold ID start end
Scaffold_14804_HRSCAF_18385 41841903 41843055
Scaffold_14804_HRSCAF_18385 47268322 47268742
Scaffold_14804_HRSCAF_18385 47277448 47277868
Scaffold_14804_HRSCAF_18385 51844347 51844998
Scaffold_14804_HRSCAF_18385 52310537 52311098

NOTE: This format might be useful when there is some kind of problem related with the gene names format. Then, the user may run some tests to check whether the input gene family is organized in cluster.

2.2. Proteins or MSA file

To compute the evolutionary distances, you will need to provide either the proteins of your gene family of interest or the corresponding MSA in FASTA format.

NOTES:

2.3. Chromosome/Scaffold size file

This is used mainly as a guide to filter the output results and summarise the findings focusing on the main scaffolds (those corresponding to chromosomes) or a subset of scaffolds of choice (for example: the ten largest scaffolds or a list of scaffolds of interest).

Scaffold ID Length (in bp) Scaffold associated name
Scaffold_15362_HRSCAF_19823 317950935 ChrX
Scaffold_14804_HRSCAF_18385 177171321 Chr1
Scaffold_14178_HRSCAF_16784 176727214 Chr2

If you don't have this file, you can generate by running the following script: Get_scaffold_length.pl located in GALEON_masterScripts/Scripts. It takes as an input your genome file in FASTA format.

Command example:

perl GALEON_masterScripts/Scripts/Get_scaffold_length.pl YOURgenome.fasta

NOTE: The output table will have the above-described 3-column format. Optionally, you can rename the third column by replacing the Scaffold IDs with Scaffold associated names. Check the example file.


3. Running GALEON

3.1. Estimate g parameter (mode: gestimate)

In this mode, the pipeline estimates the expected number of genes found in a number of bases, as well as the number of genes expected across the g input values and the probability of finding 2 or more genes in a window of g size (i.e.: 100 Kb), which would be considered as a cluster in the following analyses (Section 3.2.).

Run the following command to estimate the g parameter based on the inputs. No input files are required here.

Help message

GALEON_ControlScript.py gestimate -h

Commands

# Run using one g value
GALEON_ControlScript.py gestimate -n 134 -s 1354 -g 100

# Test several g values
GALEON_ControlScript.py gestimate -n 134 -s 1354 -g 150,200,300,400

Output

# Output table
g_estimation_Results_Directory/
└── g_estimation.table.txt

# Logs and error messages
Logs_gestimate_mode/
├── gestimation.err
└── gestimation.out
  1. column: g value - input g value
  2. column: Mb per gene family member - 1 gene, expected to be found each "X" Mb
  3. column: Expected genes per Mb - # of genes, expected to be found each Mb
  4. column: Exp. genes per g value - # of genes, expected to be found each 1 Kb
  5. column: Exp. Genes/g value - # of genes, expected to be found each "g" Kb
  6. column: P(X>=2) in g kb - The probability of finding by chance two (or more) genes in a "g" kb stretch
  7. column: Poisson's lambda


3.2. Gene cluster identification (mode: clusterfinder)

In this mode the pipeline analyzes one (or several) gene families' data to identify clusters of genes in the genome.

Help message

GALEON_ControlScript.py gestimate -h

3.2.1. Single family analysis using physical distances

In this case, coordinates files are going to be analyzed to get pairwise distances between genes and arrange them in a distance matrix. This matrix will then be scanned to identify gene clusters. Finally, the distance matrix will be displayed as a heatmap with all the identified clusters (if any) represented by black square shapes.

Inputs

# How your input annotation directory "GFFs/" should look
└── GFFs
    ├── GR_fam.gff3
    └── IR_fam.gff3

Commands

Follow the instruction to run the analysis, generate the plots and create a final portable HTML report which will provide an overview of all the obtained results at a glance.

Step 1) Find clusters, independently for each input gene family using the coordinates files.

# Simplest command to run Galeon
GALEON_ControlScript.py clusterfinder -a GFFs/ -e disabled

NOTE: The use of evolutionary distances is disabled here (-e disabled).

Step 2) Generate summary plots and tables for each input gene family

# Generate summary files for the GR family
GALEON_SummaryFiles.py -fam GR -clust clusterfinder_Results_Directory/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

# Generate summary files for the IR family
GALEON_SummaryFiles.py -fam IR -clust clusterfinder_Results_Directory/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

Step 3) Generate a final HTML report

# Generate the final HTML report
GALEON_Report.py -clust clusterfinder_Results_Directory/ -ssize ChrSizes.txt -echo False


Output

GALEON will generate a portable HTML report, one for each family and per tested g value. It will contain all the generated tables and reports, making it easy to quickly access all the results. The report includes a "HELP" tab with useful information for interpreting the results.

Galeon results directory content

Galeon results directory, example tree-like representation for two gene families "GR_fam" and "IR_fam":

clusterfinder_Results_Directory/
├── PhysicalDist_Matrices
│   ├── GR_fam.gff3.temp_matrices # Physical distance matrices, *matrix (in bp units)
│   ├── IR_fam.gff3.temp_matrices 
│   ├── GR_fam.gff3.temp_matrices_100.0g # Heatmaps in svg and pdf format
│   └── IR_fam.gff3.temp_matrices_100.0g
|
├── Plots # Contains summary tables and plots for each input family
│   ├── GR_fam
│   │   ├── GR_family_ClusterSizes.table.100.0g.tsv
│   │   ├── GR_family_GeneLocation.table.100.0g.tsv
│   │   ├── GR_family_GeneOrganizationGenomeSummary.table.100.0g.tsv
│   │   ├── GR_family_GeneOrganizationSummary.table.100.0g.tsv
│   │   |
│   │   ├── IndividualPlots_100.0g 
│   │   └── SummaryPlots_100.0g
│   └── IR_fam
│   │   ├── IR_family_ClusterSizes.table.100.0g.tsv
│   │   ├── IR_family_GeneLocation.table.100.0g.tsv
│   │   ├── IR_family_GeneOrganizationGenomeSummary.table.100.0g.tsv
│   │   ├── IR_family_GeneOrganizationSummary.table.100.0g.tsv
│   │   |
│   │   ├── IndividualPlots_100.0g
│   └── └── SummaryPlots_100.0g
|
└── Reports # One report for each family and tested g value.
    ├── GR_fam_100.0g_Report.html
    └── IR_fam_100.0g_Report.html


3.2.2. Single family analysis using physical and evolutionary distances

Coordinates files will be processed as described in Section 3.2.1 to obtain the matrices and identify the clusters. However, in addition to coordinate files, proteins will be included to compute evolutionary distances. These distances will then be merged with the physical distance matrix by replacing the upper semi-matrix values. This "merged" matrix will be displayed as a heatmap, with all the identified clusters (if any) represented by black square shapes.

Inputs

# How your input annotation directory "GFFs/" and protein directory "Proteins/" should look
├── GFFs
│   ├── GR_fam.gff3
│   └── IR_fam.gff3
└── Proteins
    ├── GR_fam.fasta # or GR_fam.aln if MSA is provided
    └── IR_fam.fasta # or IR_fam.aln

Commands

Follow the instruction to run the analysis, generate the plots and create a final portable HTML report which will provide an overview of all the obtained results at a glance.

Step 1) Find clusters using the input coordinates and protein files

# Simplest command to run Galeon
# Run this...
GALEON_ControlScript.py clusterfinder -a GFFs/ -e enabled -p Proteins/

# ...or this if the MSA files are already present in the "Proteins/" directory for each of the gene families of interest
GALEON_ControlScript.py clusterfinder -a GFFs/ -e enabled -p Proteins -pm True

NOTE: Remember that the Protein and Gene IDs must be equal. For example, let's consider a GFF3 file with a "gene" named "ABC" and "mRNA" named "ABC.t1".

scaffold source feature start end score strand frame attribute
Scaffold1 AnnotGFF gene 100 1000 . - . ID=ABC;annot;Pos:1-409;
Scaffold1 AnnotGFF mRNA 100 1000 . - . ID=ABC.t1;Parent=ABC;annot;Pos:1-409;
# (modified) Simplest command to run Galeon
# Run this...
GALEON_ControlScript.py clusterfinder -a GFFs/ -e enabled -p Proteins/ -feat gene

# (modified) ...or this if the MSA files are already present in the "Proteins/" directory for each of the gene families of interest
GALEON_ControlScript.py clusterfinder -a GFFs/ -e enabled -p Proteins -pm True -feat gene

Step 2) Get evolutionary statistics (Cst) and perform the Mann-Whitney test

# Get evolutionary statistics (Cst) and perform the Mann-Whitney test
GALEON_GetEvoStats.py -clust clusterfinder_Results_Directory/ -prot Proteins/ -coords GFFs

Step 3) Generate summary plots, tables and the HTML report

# Generate summary files for the GR family
GALEON_SummaryFiles.py -fam GR -clust clusterfinder_Results_Directory/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

# Generate summary files for the IR family
GALEON_SummaryFiles.py -fam IR -clust clusterfinder_Results_Directory/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

# Generate the final HTML report
GALEON_Report.py -clust clusterfinder_Results_Directory/ -ssize ChrSizes.txt -echo False


Output

Galeon results directory content

Galeon results directory, example tree-like representation for two gene families "GR_fam" and "IR_fam":

clusterfinder_Results_Directory/
│
├── MannWhitney_StatisticsResults
│    ├── GR_fam_GlobalStats_value.100.0g.txt
│    └── GR_fam_MannWhitney.results.brief.100.0g.tsv
│
├── MergedDistances_Dataframes
│    ├── GR_fam.IntermediateFiles/
│    ├── GR_fam.merged.matrices/ # Physical + Evolutionary distance matrices
│    ├── GR_fam.plots_100.0g/ # Physical + Evolutionary distance Heatmaps and Scatterplots in svg and pdf format
│    │
│    ├── IR_fam.IntermediateFiles/
│    ├── IR_fam.merged.matrices/
│    ├── IR_fam.plots_100.0g/
│    │
│    ├── GR_fam.GlobScatterPlot_100.0g.pdf # Physical vs Evo. distance scatter plot considering at genome level, that is, considering all the genes of the input gene family
│    ├── GR_fam.GlobScatterPlot_100.0g.svg
│    ├── IR_fam.GlobScatterPlot_100.0g.pdf
│    └── IR_fam.GlobScatterPlot_100.0g.svg
│
├── PhysicalDist_Matrices
│    ├── GR_fam.gff3.temp_matrices/  # Physical distance matrices, *matrix (in bp units)
│    ├── GR_fam.gff3.temp_matrices_100.0g/ # Heatmaps in svg and pdf format
│    │
│    ├── IR_fam.gff3.temp_matrices/
│    └── IR_fam.gff3.temp_matrices_100.0g/
│
├── Plots # same content as in "Section 3.2.1"
│    ├── GR_fam
│    │   ├── IndividualPlots_100.0g/
│    │   └── SummaryPlots_100.0g/
│    └── IR_fam
│        ├── IndividualPlots_100.0g/
│        └── SummaryPlots_100.0g/
│
└── Reports/ # same content as in "Section 3.2.1"
    ├── GR_fam_100.0g_Report.html
    └── IR_fam_100.0g_Report.html

3.2.3. Joint family analysis

Find clusters between two input families using the coordinates from the input files. Note that only two families can be analyzed at once, and protein sequences cannot be used in this mode.

Inputs

# How your input annotation directory "GFFs/" should look
└── GFFs
    ├── GR_fam.gff3
    └── IR_fam.gff3

Commands

Step 1) Find clusters using the input coordinates

# Simplest command to run Galeon
GALEON_ControlScript.py clusterfinder -a GFFs/ -e disabled -F BetweenFamilies

Step 2-3) Generate summary plots, tables and the HTML report

# Generate summary files for the GR, IR families and merged data of both
GALEON_SummaryFiles.py -fam merged -clust clusterfinder_Results_Directory/ -coords GFFs/merged_dir/ -ssize ChrSizes.txt -sfilter 7

# Generate the final HTML report
GALEON_Report.py -clust clusterfinder_Results_Directory/ -ssize ChrSizes.txt -echo False

Note how the -coord parameter is specified here (-coord GFFs/merged_dir), it is a bit different because you must add "merged_dir" at the end of the command.

Galeon results directory content

Galeon results directory, example tree-like representation for two gene families "GR_fam" and "IR_fam":

clusterfinder_Results_Directory/
├── PhysicalDist_Matrices
│   ├── merged_fam.gff3.temp_matrices # contains matrices *matrix
│   └── merged_fam.gff3.temp_matrices_100.0g # physical distance plots in svg and pdf format
│
├── Plots
│   └── merged_fam # summary tables and plots
│       ├── GR_family_ClusterSizes.table.100.0g.tsv
│       ├── GR_family_GeneOrganizationGenomeSummary.table.100.0g.tsv
│       ├── GR_family_GeneOrganizationSummary.table.100.0g.tsv
│       │
│       ├── IR_family_ClusterSizes.table.100.0g.tsv
│       ├── IR_family_GeneOrganizationGenomeSummary.table.100.0g.tsv
│       ├── IR_family_GeneOrganizationSummary.table.100.0g.tsv
│       │
│       ├── GR.IR_family_GeneOrganizationSummary.table.100.0g.tsv
│       │
│       ├── merged_family_GeneLocation.table.100.0g.tsv
│       │
│       ├── IndividualPlots_100.0g
│       │   ├── GR # Gr family clusters' size distribution
│       │   ├── GR.IR # Two family clusters size distribution
│       │   ├── IR # Gr family clusters' size distribution
│       │   └── merged # All clusters' size distribution
│       └── SummaryPlots_100.0g
│           ├── GR
│           ├── GR.IR
│           ├── IR
│           └── merged
│           
└── Reports
    ├── merged_fam_100.0g_Report.html
    └── merged_fam_100.0g_Report.Rmd


4. Example dataset

Several test datasets are available in Example_data directory. You can enter to each test directory and run the commands described below. The output files for each example are also provided in the file Solved_examples.tar.gz

4.1. Test 1. Estimate g parameter

*No input files are required here.

cd Test_1

# Run using one g value
GALEON_ControlScript.py gestimate -n 134 -s 1354 -g 100

# Test several g values
GALEON_ControlScript.py gestimate -n 134 -s 1354 -g 150,200,300,400

4.2. Test 2. Single family analysis using physical distances

cd Test_2

# Run Galeon
GALEON_ControlScript.py clusterfinder -a GFFs/ -g 100,200 -e disabled -outdir 2_OneFam_PhysDistOnly_GFF3

# Generate summary files and tables for GR family
GALEON_SummaryFiles.py -fam GR -clust 2_OneFam_PhysDistOnly_GFF3/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

# Generate summary files and tables for IR family
GALEON_SummaryFiles.py -fam IR -clust 2_OneFam_PhysDistOnly_GFF3/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

# Create a summary report
GALEON_Report.py -clust 2_OneFam_PhysDistOnly_GFF3/ -ssize ChrSizes.txt -echo False

4.3. Test 3. Single family analysis using physical and evolutionary distances (using unaligned protein sequences)

There are three directories, one for each option to compute the evolutionary distance:

cd Test_3_iqtree-fast/

# Run Galeon
GALEON_ControlScript.py clusterfinder -a GFFs/ -g 100,300 -e enabled -p Proteins/ -outdir 3_OneFam_PhysEvoDistances_GFF3/ -f orange

# Get evolutionary statistics (Cst) and perform the Mann-Whitney test
GALEON_GetEvoStats.py -clust 3_OneFam_PhysEvoDistances_GFF3/ -prot Proteins/ -coords GFFs

# Generate summary files and tables for GR family
GALEON_SummaryFiles.py -fam GR -clust 3_OneFam_PhysEvoDistances_GFF3/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

# Generate summary files and tables for IR family
GALEON_SummaryFiles.py -fam IR -clust 3_OneFam_PhysEvoDistances_GFF3/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

# Create a summary report
GALEON_Report.py -clust 3_OneFam_PhysEvoDistances_GFF3/ -plots Plots -ssize ChrSizes.txt -echo False

4.4. Test 4. Single family analysis using physical and evolutionary distances (using a protein MSA)

cd Test_4

# Run Galeon
GALEON_ControlScript.py clusterfinder -a GFFs/ -g 100,300 -e enabled -p Proteins/ -pm True -outdir 4_OneFam_PhysEvoDistances_GFF3_pm/

# Get evolutionary statistics (Cst) and perform the Mann-Whitney test
GALEON_GetEvoStats.py -clust 4_OneFam_PhysEvoDistances_GFF3_pm/ -prot Proteins/ -coords GFFs

# Generate summary files and tables for GR family
GALEON_SummaryFiles.py -fam GR -clust 4_OneFam_PhysEvoDistances_GFF3_pm/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

# Generate summary files and tables for IR family
GALEON_SummaryFiles.py -fam IR -clust 4_OneFam_PhysEvoDistances_GFF3_pm/ -coords GFFs -ssize ChrSizes.txt -sfilter 7

# Create a summary report
GALEON_Report.py -clust 4_OneFam_PhysEvoDistances_GFF3_pm/ -plots Plots -ssize ChrSizes.txt -echo False

4.5. Test 5. Joint analysis of two gene families

cd Test_5

# Run Galeon
GALEON_ControlScript.py clusterfinder -a GFFs/ -F BetweenFamilies -e disabled -outdir 5_TwoFamJointAnalysis_GFF3 -cmap_1 Blues_r -f red -f2 orange

# Generate summary files and tables
GALEON_SummaryFiles.py -fam merged -clust 5_TwoFamJointAnalysis_GFF3/ -coords GFFs/merged_dir/ -ssize ChrSizes.txt -sfilter 7

# Create a summary report
GALEON_Report.py -clust 5_TwoFamJointAnalysis_GFF3/ -ssize ChrSizes.txt -echo False

4.6. Test 6. GR and IR single family analysis using physical and evolutionary distances (using a protein MSA) in Dysdera silvatica

This dataset includes the information for 98 GRs and 411 IRs annotated in Dysdera silvatica genome (Escuer et al. 2022).

cd Test_6

# Run Galeon
GALEON_ControlScript.py clusterfinder -a GFFs/ -g 100 -e enabled -p Proteins/ -outdir 6_OneFam_PhysEvoDistances_GFF3/ -f orange -pm True

# Get evolutionary statistics (Cst) and perform the Mann-Whitney test
GALEON_GetEvoStats.py -clust 6_OneFam_PhysEvoDistances_GFF3/ -prot Proteins/ -coords GFFs

# Generate summary files and tables for GR family
GALEON_SummaryFiles.py -fam GR -clust 6_OneFam_PhysEvoDistances_GFF3/ -coords GFFs -ssize ChrSizes.txt -sfilter 7 -colval magma

# Generate summary files and tables for IR family
GALEON_SummaryFiles.py -fam IR -clust 6_OneFam_PhysEvoDistances_GFF3/ -coords GFFs -ssize ChrSizes.txt -sfilter 7 -colval magma

# Create a summary report
GALEON_Report.py -clust 6_OneFam_PhysEvoDistances_GFF3/ -plots Plots -ssize ChrSizes.txt -echo False

5. Citation

Vadim Pisarenco, Joel Vizueta, Julio Rozas. GALEON: A Comprehensive Bioinformatic Tool to Analyse and Visualise Gene Clusters in Complete Genomes. Submitted. 2024. https://www.biorxiv.org/content/10.1101/2024.04.15.589673v1

6. References

[1] Escuer, P., Pisarenco, V. A., Fernández-Ruiz, A. A., Vizueta, J., Sánchez-Herrero, J. F., Arnedo, M. A., Sánchez-Gracia, A., & Rozas, J. (2022). The chromosome-scale assembly of the Canary Islands endemic spider Dysdera silvatica (Arachnida, Araneae) sheds light on the origin and genome structure of chemoreceptor gene families in chelicerates. Molecular Ecology Resources, 22, 375–390. https://doi.org/10.1111/1755-0998.13471

7. Troubleshooting

Should you encounter any error, please create an issue on GitHub specifying the error and providing as many details as possible.