nicocriscuolo / StructuRly

Comprehensive, detailed and interactive plots for STRUCTURE and ADMIXTURE population analysis
https://nicocriscuolo.shinyapps.io/StructuRly/
GNU General Public License v2.0
19 stars 8 forks source link
admixture barplots comparing-partitions hierarchic-analysis interactive-plots plotly population-analysis q shiny structure tables triangleplot

StructuRly 0.1.0

StructuRly is an R package containing a shiny application to produce detailed and interactive graphs of the results of a Bayesian cluster analysis obtained with the most common population genetic software used to investigate population structure, such as STRUCTURE or ADMIXTURE. These software are widely used to infer the admixture ancestry of samples starting from genetic markers such as SNPs, AFLPs, RFLPs and microsatellites (such as SSRs). More generally, StructuRly can generate graphs from any file containing admixture information of each sample (encoded in percentages in a range from 0 to 1). We developed StructuRly to provide researchers with detailed graphical outputs to interpret their statistical results through the use of software with a user-friendly interface, which can, therefore, be easily used by those who do not know a programming language. In fact, in a typical StructuRly output, the user will have the possibility to display information about the ID of each sample, the original membership assigned by the researcher to the sampled populations (or subpopulations) and the label of the sampling site, a variable, the latter, which is used in software for population analysis to support the data analysis algorithm. Furthermore, interactivity is a typical feature of StructuRly outputs, which allows the user to extrapolate even more information through a single chart.

However, this shiny application presents more different features to:

Installation

You can install the released version of StructuRly from GitHub in RStudio with:

install.packages(pkgs = "devtools")

library(devtools)

install_github(repo = "nicocriscuolo/StructuRly", dependencies = TRUE)

Once the package is loaded and the dependencies installed, you can run the software in the default browser through the following functions:

library(StructuRly)

runStructuRly()

If you have trouble installing StructuRly you can follow the instructions present this link.

System requirements

StructuRly works on macOS, Windows and Linux operative systems. Install the updated version of R (>= 3.5) and RStudio and launch StructuRly on all types of browsers (Internet Explorer, Safari, Chrome, etc.). In its current version, it can also work locally and then offline. If you need any information about the usage of STRUCTURE or ADMIXTURE software (e. g. instructions to launch the software, preparation of input files and how to exports the outputs), please visit their websites at the following links:

Moreover, the user can launch the Terminal (to start an ADMIXTURE population analysis) or the STRUCTURE software directly from the user interface of StructurRly (this function is currently available for macOS and Linux users). To make this buttons work, both software must be installed on your computer.

N. B.: If you use a Linux based machine, to properly configure R and to install some StructuRly dependencies you may need specific Linux libraries to make these software work with this operative system. To install these libraries in R follow the instructions displayed inside the R console when you load the dependency packages.

Online version

If you are not familiar with R or RStudio you can access to StructuRly directly from the web by using the following link: https://nicocriscuolo.shinyapps.io/StructuRly/.

Data input

StructuRly is divided into three different sections depending on the input file chosen. For any type of file, the header of each variable is mandatory and varies according to the type of variable that must be present in the input dataset. When you start a new session of StructuRly, if you change the uploaded file with a new one (inside the same section), to produce new outputs remember to re-define every time the type of separator (e. g. column, semi-column or tab) and to indicate if your data have quotation marks.

Data format

In the first section of StructuRly, you can import both .txt and .csv file. Since the second section also accepts the output file obtained after the population analysis performed with ADMIXTURE, here you can import also .Q format file and a .fam file (if the latter one is available).

In StructuRly you also have the possibility to export a table ready to be imported inside the STRUCTURE software. If you need detailed references about the structure of this dataset and how to perform the population analysis with STUCTURE you can find them this link. If you want to use your raw genetic data to produce an input table for the ADMIXTURE software, you have to convert your matrix in a .ped or .bed file. You can do that through the functionalities of the PLINK software, illustrated step by step at this link. If you need more information about this last data formats, they are available here.

Download sample datasets

Examples of the .txt, .csv, .Q and .fam files that you can import into StructuRly are present at the following repository link: Sample datasets (the .Q and the .fam files are obtained after an ADMIXTURE analysis with the sample files downloadable directly from the ADMIXTURE website).
To download the sample datasets from GitHub, right-click on the desired file and choose Download linked file. The sample datasets are available in pair of two files: one contains the raw genetic data and the other the results of the STRUCTURE analysis performed on such data. They have different format and information to describe different use-case scenario, in particular:

Section 1: Import raw genetic data

The input for this section can contain three optional variables present in the following order and whose header must be precisely the one shown below:

The following variables present in the dataset to import in this section are mandatory and must contain numerical values relative to the types of markers used. Depending on the ploidy of the organism analyzed, there must be a number of columns for each locus equal to the number of alleles, in particular:

image\_1

N. B.: for the Sample_ID, Pop_ID and Loc_ID columns, avoid the usage of the name “NA” to indicate a name of a sample, of a putative population or a collection site, because StructuRly could recognize that characters as a missing value and the plot will not display the correct information. This also applies for the preparation of the input datasets for the Section 2.

Missing values

When you produce the file for this section of StructuRly, the missing values must be indicated only with the abbreviation NA. The cells of the reactive table (in the table panel named “Input table”) that contain missing values will appear empty, while they are codified as -9 in the table that can be produced and downloaded by StructuRly to be imported into STRUCTURE.

N. B.: if your data refer to diploid or polyploid organisms and you encounter a missing value in one or more of your samples in a specific locus, the NA value must be present for all the alleles of that locus;

Section 2: Import population analysis

Here the user can import a dataset obtained directly following the population analysis of his genetic data. The characteristics of this input file are not very different from the one to be imported in the previous section:

Below there is an example of this type of file structure. In this case the Loc_ID column is not present; in fact, the three information variable are not mandatory for the datasets to import in section 1. and 2.:

image\_2

Section 3: Compare partitions

The third section uses the first two sections input files to compare the partitions obtained from the hierarchical and Bayesian cluster analysis. Obviously, the imported datasets must refer to data of same nature and the number of observations must be the same in both files. The samples cluster memberships of the admixture ancestry analysis partition are assigned considering the highest value of ancestry found in a specific population (STRUCTURE or ADMIXTURE cluster) for each sample. It means that this partition will divide the observations in the same number of clusters chosen for the population analysis, but if the admixture ancestry is the lowest for a particular subpopulation, this cluster will not be shown in the comparison plot and table, because there are no observations assigned to it.

Outputs download

The following image shows the main output downloaded from StructuRly, the barplot of the ancestry admixture. The sample labels on the X axis are colored according to the population indicated in the user input file, while the symbols at the top of the plot indicate the sampling site. In StructuRly there are 25 different symbols available but you can also simply decide to split the entire plot on the basis of the different categories inside the Pop_ID and Loc_ID variables.

image\_3

All StructuRly outputs can be downloaded as images in various high-quality formats directly from the user interface. However, to download the graphs related to the Triangle plot, obtained through a specific function of the plotly package (and not with those of ggplot2) you need to download the orca software in your computer and follow the instructions at this link. If you don’t install the orca software you can always download the Triangle plot through the functionalities of the plotly package through the commands displayed directly on the interactive plot.

N. B.: for a dataset with a high sample number (> 500) remember to re-size your plot (width, height and resolution) to better distinguish the bars and the relative IDs.

Example

Here’s a link to the YouTube video of StructuRly showing an example of using of the software. Moreover, the flowchart below, accessible from the Instructions panel of the application, schematize a tutorial to use the software.

image\_4

Known bugs and limitations

The slight bugs related to some characteristics of the graphs are shown only inside the interactive plots, but the downloaded file won’t present any problem.

Citation

StructuRly was firstly presented during the International BBCC meetings held in Naples (Italy) in November 2018 and its implementation has been described inside the paper StructuRly: a novel shiny app to produce comprehensive, detailed and interactive plots for population genetic analysis (submitted). If you use this package for your research please cite:

Contact

For additional information regarding StructuRly, please consult the documentation or email us.