The goal of genepopedit is to provide a simple and flexible tool for manipulating large multi-locus genotype datasets in R.
Note input parameters for genepopedit versions > 1.0.0.5 are now all completely lowercase.
Use genepopedit to subset a SNP dataset by:
removing specified loci.
removing specified populations.
removing specified individuals.
renaming populations.
grouping populations.
reordering populations.
extracting meta-data:
create datasets for training, assignment, and outlier detection, according to a population stratified random sample.
convert Genepop to STRUCTURE, FSTAT, NEWHYBRIDS, ASSIGNER, BGC, TREEMIX, COLONY, Genetic Stock Identification (gsi_sim), HZAR, and flattened/unflattened format.
simulate individual genotypes using pooled DNA allele frequencies.
Fig 1. genepopedit workflow including data preparation, diagnostics, manipulation and transformation. Files, functions, and function operations are denoted by black, grey and blue text, respectively. Function inputs and outputs are denoted by dashed and solid lines, respectively.
Requirement: genepopedit functions through the manipulation of multi-locus diploid SNP files structured in the Genepop file format link. Specifically, we use the three number format (e.g. 110110) where the six digits correspond to the alleles of a given locus for an individual. Locus names can be listed in the first row separated by columns or each on their own row so that the total number of rows in the Genepop file equals:
nrows = nLOCI + nINDIVIDUALS + nPOPULATIONS - 1
or
nrows* = nLOCI + nINDIVIDUALS + nPOPULATIONS
* if STACKS version is not specified
For example:
A three locus dataset with two populations and four individuals per population with the STACKS version specified
STACKS Version 1.0
Loci_1
Loci_2
Loci_3
Pop
BON_01 , 120120 110110 110110
BON_02 , 100100 110110 110110
BON_03 , 100100 110110 110110
BON_04 , 100100 110110 110110
Pop
TAG_01 , 120120 110110 110110
TAG_02 , 120120 110110 110110
TAG_03 , 120120 110110 110110
TAG_04 , 120120 110110 110110
Alternatively, the loci names can be read in the first row as a single character separated by commas.
Loci_1,Loci_2,Loci,3
Pop
BON_01 , 120120 110110 110110
BON_02 , 100100 110110 110110
BON_03 , 100100 110110 110110
BON_04 , 100100 110110 110110
Pop
TAG_01 , 120120 110110 110110
TAG_02 , 120120 110110 110110
TAG_03 , 120120 110110 110110
TAG_04 , 120120 110110 110110
In both formats each row is read in as a single character vector. Sample IDs have the population and sample number separated by a "_". Between sample ID and the loci is conventional Genepop separator " , " (space,space space). Note if your population label is note separated from the sample number in the Sample IDs refer to the help section for . subset_genepop_rename
Note that input and output "path" variables all require the FULL file path. Relative paths will not work with genepopedit functions.
Most molecular based file formats can be converted to and from Genepop using conversion programs such as the R package adegenet or the program PGDspider.
You can install genepopedit as a R package using the following 2 steps. Before installation it is recommended that you open up a new R session without any packages loaded. During the devtools installation process some package dependencies may be downloaded or updated. Installation errors can arise if an older version of a package dependency is already loaded.
Step 1 Install the R package devtools
if (!require("devtools")) install.packages("devtools") # to install
Step 2 Install genepopedit:
#install the diveRsity dependency from Github ** this package may no longer be supported by CRAN, depending on your R Version **
#install the package from *Github*
devtools::install_github("kkeenan02/diveRsity")
devtools::install_github("rystanley/genepopedit")
library(genepopedit) # load the library
Step 3 Install PGDspider and plink. Note these programs are required for the use of genepop_toploc() and PGDspideR() only!
For the time being, this function only works with Plink 1.9. Use of Plink 2.0 will create function issues.
Note that appveyor build 'failure' is linked to an error in the build check and not genepopedit itself. This package has been tested on Windows, Linux (Unbuntu & Mint), and IOS operating systems.
genepopedit was written in collaboration:
Support funding was inpart provided by the Canadian Healthy Oceans Nework CHONe
This package has been developed to be used by anyone who is looking for a more efficient and repeatable method to manipulate large multi-locus datasets. The package is open, and I encourage you to tinker and look for improvements. I will do my best to respond to any inquiries, add additional functions and-or functionality, and improve the efficiency of the package.
Pull request how-to:
Full package description and citation now available at Molecular Ecology Resources http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12569/abstract
A Zenodo DOI is also avaiable for the most recent release of genepopedit:
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
variable | variable to report. |
Calculate allele frequencies (major) for a given set of loci aggregated by population grouping.
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
popgroup | population grouping using the "Pop" delimiter (Default: NULL) or a dataframe or path to a csv. The grouping dataframe should have two columns, the first corresponding to the population name and the second to an aggregation vector of common groups. Each population can only be assigned to one group. |
wide | logical (default: FALSE) defining whether the output should be cast in 'wide' format. Note that 'wide' format is accepted as the input for alleleotype_genepop(). |
Filter loci by global minor allele frequency. For the time being, this function only works with Plink 1.9. Use of Plink 2.0 will create function issues.
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
where.plink | A file path to the PLINK installation folder. |
where.pgdspider | A file path to the PGDspider installation folder. |
maf | Minor allele frequency cutoff (default = 0.05) |
path | the filepath & filename of output. |
This function will return the the panel with the highest Fst for unlinked loci ordered by Fst. Returned loci have the highest Fst from the original panel and are not linked to any other loci in the returned panel. Note this function requires the installation and access to both PGDspider for format conversion & PLINK for linkage calculations. For the time being, this function only works with Plink 1.9. Use of Plink 2.0 will create function issues. See installation instructions here
Variable name | Input |
---|---|
genepop | A path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
where.plink | A file path to the PLINK installation folder. |
where.pgdspider | A file path to the PGDspider installation folder. |
r2.theshold | Linkage threshold based on correlation between loci. Value must be between 0 & 1. Default: 0.2 matched to PLINK defaults. |
fst.threshold | The minimum r^2 threshold to consider a pair of loci to be considered linked (default: 0.05). |
ld.window | The number of sequential loci to be considered for linkage. This sliding window is based on the input order of loci in the input data. Default (NULL) will calcualte linkage among all loci. This variable should not be changed unless the order and proximity of loci on the genome is known and integrated into the order within the input data. |
ldpop | A string which denotes which of the populations you wish to calculate linkage disequilibrium from. The default is to calculate LD among all ("All") populations. Population names must match those returned by genepop_detective. |
allocate.pgd.ram | An integer defining how many GB of ram will be made available to PGD spider during calculation of LD. Default (1 GB) should be sufficient for most calculations. Note that for Windows based platforms the limitation for PGDspider is ~1 GB. If a value other than default is specified on Windows allocate.pgd.ram will be set to 1 GB. |
return.workspace | Logical query (default: TRUE) sepecifying whether to return the output to the workspace. |
save.output | Logical query (default: FALSE) to save the output to the same location as the file being analyzed. Each of the outputs of the function will be saved as a separate file with the file name of the orginal data appended with "Linkages", "Loci-FST", and "Unlinked-Loci-FST" for the pairwise linked loci along with their r^2, all loci with their global Fst, and only the top unlinked loci with their Fst respectively. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
subs | vector loci names of interest (default: NULL) |
keep | logical whether to keep loci specified by subs (default: TRUE) or to keep remaining loci. |
spop | populations to be retained (default: NULL). |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
nameframe | a dataframe or path to a csv detailing the original and any edited population names or sampleIDs. |
renumber | logical (default: FALSE) whether the sample numbers are to be replaced. |
meta | character defining which meta information is being edited. Options are "Pop" (default) or "Ind" for populations or sampleIDs respectively. This parameter must be specified. |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
subs | vector loci names of interest (default: NULL) |
keep | logical whether to keep loci specified by subs (default: TRUE) or to keep remaining loci. |
agpopframe | a dataframe or path to a csv detailing the population names & populations to be aggregated. |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
indiv | vector sample IDs of interest. |
keep | logical whether to delete sample IDs specified by indiv (default: TRUE) or delete all other IDs. |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
reorder | vector of populaton names which will be used to reorder the output file. Note population names must be unique(n=1) and match those returned by genepop_detective(). |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
input | complete file path to the input file to be converted or an object in the workspace. The first column should denote populations and adjacent columns the allele frequency (major or minor) for each loci (named in the column header). |
numsim | the number of simulated individuals to be returned per population (default: 100). |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file |
nsample | object which defines sampling. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
popgroup | popgroup is a dataframe or path to a csv. column one specifies the population and column two has the numeric grouping. If NULL (default) then population levels will be assigned an individual group. |
locusnames | logical (default=FALSE). Specify TRUE if you want the locus names from your Genepop file to be the first row in your Structure files. |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file |
path | the filepath & filename of output. |
addworkspace | logical to specify whether to save fstat conversion to path (default) or to workspace as Output_fstat object. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
popgroup | popgroup is a dataframe or path to a csv. column one specifies the population and column two has the numeric grouping. If NULL (default) then population levels will be assigned an individual group. |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
where.pgdspider | A file path to the PGDspider installation folder. |
where.plink | A file path to the PLINK installation folder. |
denote.missing | The value that denotes missing data in your input file (default: "000"). |
allocate.pgd.ram | An integer defining how many GB of ram will be made available to PGD spider during calculation of LD. Default (1 GB) should be sufficient for most calculations. Note that for Windows based platforms the limitation for PGDspider is ~1 GB. If a value other than default is specified on Windows allocate.pgd.ram will be set to 1 GB. |
path | file path to directory where the Colony files (4) will be saved. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
popdef | popgroup is a dataframe or path to a csv. This dataframe contains two columns. Column 1 corresponds to the population names. The next column has the grouping classification corresponding to each population defining parental 1 ("P1") parental 2 ("P2") and admixed ("Admixed") populations. |
fname | collective name assigned to each output the output files (3). |
path | the path to directory where the BGC files will be saved. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
where.pgdspider | A file path to the PGDspider installation folder. |
where.plink | A file path to the PLINK installation folder. |
allocate.pgd.ram | An integer defining how many GB of ram will be made available to PGD spider during calculation of LD. Default (1 GB) should be sufficient for most calculations. Note that for Windows based platforms the limitation for PGDspider is ~1 GB. If a value other than default is specified on Windows allocate.pgd.ram will be set to 1 GB. |
keep_inter | A logical vector specifying whether the intermediate conversion steps (.map,.ped, and clustering file) should be returned to the path. |
path | the path to directory where the TREEMIX files will be saved. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
path | the filepath & filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file |
distances | A dataframe or path to a text file with your distances between populations. Should contain 2 columns-Populations and Distances.There should be the same number of populations as in the Genepop file. |
path | the filepath and filename of output. |
Variable name | Input |
---|---|
genepop | a path to a Genepop file or a dataframe read into the workspace of a Genepop file. |
Variable name | Input |
---|---|
df | data.frame object in workspace. First column is the sampleID (e.g. "BON_01") the remaining columns are Loci. |
path | the filepath & filename of output. |
We have developed an R interface for PGDspider. This function will enable R to interact with PGDspider to convert among file formats not included in genepopedit conversion functions. Currently this is best accomplished using the .spid files generated by PGDspider to define the conversion process. Once a .spid file is created, genepopedit can continue to convert files provided the conversion parameters in the .spid file match those required for the conversion. This program will require installation of PGDspider. See installation instructions here
Variable name | Input |
---|---|
input | complete file path to the input file to be converted. |
input_format | format of the input file. This format should match the dropdown menus of pgdSpider in terms of spelling and capitalization (e.g. GENEPOP & FSTAT). |
output | complete file path to the defining where the converted file will be stored. |
output_format | format of the of the converted file. This format should match the dropdown menus of pgdSpider in terms of spelling and capitalization (e.g. GENEPOP & FSTAT). |
spid | complete file path to the .spid file created by pgdSpider defining the conversion between input_format and output_format. Note that parameters of this .spid file must match the conversion and input file as specified by pgdSpider. |
where.pgdspider | the filepath to the folder where pgdSpider installation files are stored. |
Example data for the use in the following documentation is included within genepopedit. The example genepop data is a GenePop data format with 10 populations, 30 individuals per population, and 100 Loci.
#Load example data in your workspace. 'GenePopData' will now be available for use.
data("genepopedit_data")
head(GenePopData) #View the data and the first 4 loci names.
If reading in data to your workspace use the following read.table code specifics and associated parameters (sep="\t",quote="",stringsAsFactors=FALSE)
#Example file path to your data
file.link <- "c:/Users/YOUR NAME/Documents/Data/genepopdata.txt"
#If your dataset has fewer than ~15,000 loci *read.table* should suffice
GenePop <- read.table(file.link,sep="\t",quote="",stringsAsFactors=FALSE)
#If your dataset has a large amount of loci (>15,000) then we suggest using the data.table package to load in the data more efficiently.
GenePop <- data.table::fread(file.link,header=FALSE, sep="\t",stringsAsFactors=FALSE)
However, reading in your GENEPOP data into the workspace is not required, as all functions can work with both workspace objects and file paths.
Set a directory where the output files will be saved.
output_dir <- "c:/Users/YourName/Documents/GenePopFiles/"
Functions available in genepopedit require that the population and sample number be separated by an _ in the sample ID (e.g. BON_01 not BON01). This can be accomplished manually using a text editor or using the function genepop_ID. genepop_ID will read in a Genepop file and look for a common character string within the sample ID vectors grouped by the Genepop "Pop" labels. This function assumes that there is a common string which unifies IDs between each of the Pop labels. If this isn't the case, some pre-filtering of the data using a text editor will be required. Function description
#Use genepop_ID to fix the sample IDs prior to the use of genepopedit functions.
filepath="~myfile.txt" #path to 'myfile' which has sample IDs without the required _ separation. Note this can also be an object in the workspace.
genepop_ID(genepop=filepath, path=paste0(output_dir,"Genepop_SampleID_fixed.txt"))
This function queries meta-data from your genepop data without having to open it in a text editor. Function returns a vector of population names, population counts, sample IDs, or Loci names. Function description
Return population names
PopNames <- genepop_detective(GenePopData, variable="Pops")
Return population counts
PopCounts <- genepop_detective(GenePopData, variable="PopNum")
Return sample IDs
SampleIDs <- genepop_detective(GenePopData, variable="Inds")
Return Loci names using the file path instead of the workspace object.
LociNames <- genepop_detective("genepopedit_examplefile.txt",variable="Loci")
Return All using the file path instead of the workspace object.
metadata <- genepop_detective("genepopedit_examplefile.txt",variable="All")
metadata$Pops #Populations
metadata$Loci #Loci names
metadata$Inds #Individual IDs
metadata$PopNum #Population counts
Return Allele values using the file path instead of the workspace object.
Alleles <- genepop_detective("genepopedit_examplefile.txt",variable="Allele")
This function will calculate allele frequencies based on the major allele among populations for each population group. Function description
#calculate major allele frequencies according to the default population groupings.
AlleleFreq_default <- genepop_allelefreq(GenePopData)
#calculate major allele frequencies according to the specified population groupings.
popgrouping <- data.frame(Name=c("AAA","BBB","CCC","DDD","EEE","FFF"),groups=c("gr1","gr1","gr2","gr2","gr3","gr3"))
AlleleFreq_specified <- genepop_allelefreq(GenePopData, popgroup=popgrouping)
This function will calculate global minor allele frequencies and filter based on a user defined threshold. Function description
#Filter loci with minor allele frequencies less than 0.07.
genepop_filter_maf(genepop="genepopedit_examplefile.txt",where.plink="c:/Users/YOURNAME/Documents/Programs/plink/",where.pgdspider="c:/Users/YOURNAME/Documents/Programs/PGDSpider_2.0.9.0/",maf=0.07,path=paste0(output_dir,"Genepop_maf_0-07.txt"))
This function will calculate linkage disequilibrium and global Weir and Cockerham's Fst for loci among populations. The function seeks a loci panel which maximizes global Fst for unlinked loci. The returned object is a list containing Linked loci ($Linked), global FST ($FST) and the top unlinked loci by FST ($FST_Unlinked). Note 'linked' loci are calcuated based on the threshold R2 of 0.2 used by PLINK. This threshold (0-1) can be adjusted, though values <0.2 may be limited by computational resources. Note that the available RAM for conversions using PGDspider is limited to 1 GB on Windows based operating systems. For the time being, this function only works with Plink 1.9. Use of Plink 2.0 will create function issues. Function description
Warning messages & diagnostics will be returned to the console by PGDspider & PLINK. These warnings are expected.
#Evaluate Fst and linkage among all loci and return an unlinked panel optimized by global Fst.
#To calculate LD we use PLINK which requires a file conversion using PGDspider.
TopLoci <- genepop_toploci(genepop="genepopedit_examplefile.txt",
where.plink="c:/Users/YOURNAME/Documents/Programs/plink/",
where.pgdspider="c:/Users/YOURNAME/Documents/Programs/PGDSpider_2.0.9.0/",fst.threshold = 0.04)
#Linked loci
TopLoci$Linked
#Loci FST among populations
TopLoci$FST
#Maximized FST by unlinked loci
TopLoci$FST_Unlinked
Create a new Genepop file which only contains Loci 3, 15, 23, 49, 62, 81, 88, & 94. This useful if, for example, you identify loci which are potentially under selection and want a parsed dataset. Here our loci are named in the Loci# (1:100). Function description
#inspect the names of the loci within the Genepop data
genepop_detective(GenePopData,"Loci")
#subset the Genepop file and 'keep' the specified loci names.
subloci <-c("Loci03","Loci15","Loci23","Loci49","Loci62","Loci81","Loci88","Loci94")
subset_genepop(genepop= GenePopData, keep = TRUE, subs = subloci, path = paste0(output_dir,"Genepop_Loci_selection.txt"))
You can also choose all loci not in the vector specified using keep = FALSE.
#subset the Genepop file and do not 'keep' the specified 'subs'
subset_genepop(genepop= GenePopData, keep = FALSE, subs = subloci, path = paste0(output_dir,"Genepop_Loci_neutral.txt"))
You can also remove populations within this function. Here we again select loci under selection, but remove populations 3 & 7 (CCC/GGG). To do this we create a list of all populations we want to keep (e.g. AAA, BBB, CCC ... JJJ)
#use genepop_detective to find the population names
PopNames <- genepop_detective(GenePopData,"Pops")
#create a list of populations you want to keep
#manual
PopKeep <- c("AAA","BBB","DDD","EEE","FFF","HHH","III","JJJ")
#or
#if you have fewer pops to remove
PopKeep <- setdiff(PopNames, c("CCC","GGG"))
#vector of loci to keep
subloci <- c("Loci03","Loci15","Loci23","Loci49","Loci62","Loci81","Loci88","Loci94")
subset_genepop(genepop= GenePopData, keep = TRUE, subs = subloci, spop = PopKeep, path = paste0(output_dir,"Genepop_Loci_selection_subpop.txt"))
Now change the names of some of the populations. In this example we will rename populations DDD & HHH to YYY & ZZZ. Function description
#view the population names
genepop_detective(GenePopData,"Pops")
#create a dataframe for renaming.
#column 1 = the original pop names.
#column 2 = the new names.
PopRename <- data.frame(oPop = c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ"),
newname = c("AAA","BBB","CCC","YYY","EEE","FFF","GGG","ZZZ","III","JJJ"))
#rename populations
subset_genepop_rename(genepop= GenePopData, path = paste0(output_dir,"Genepop_renamed.txt"),nameframe = PopRename)
Populations can also be grouped by a common name. The result is similar to subset-genepop-aggregate except that population names will be replaced. Here it is useful to enable renumbering (renumber = TRUE) or there individuals with the same number will be assigned the same ID, if populations are combined.
#view the population names
genepop_detective(GenePopData,"Pops")
#create a dataframe for renaming.
#column 1 = the original pop names.
#column 2 = the new names.
PopRename_group <- data.frame(oPop = c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ"),newname = c("Pop1","Pop1","Pop1","Pop1","Pop1","Pop2","Pop2","Pop2","Pop2","Pop2"))
#rename populations
subset_genepop_rename(genepop= GenePopData, nameframe = PopRename_group, renumber = TRUE, meta="Pop",path = paste0(output_dir,"Genepop_renamed_renumbered.txt"))
Alternatively the sampleIDs can be renamed directly.
#create a renaming vector (column 1 = old column 2 = new)
IndRename_group <- data.frame(old = c("AAA_10","BBB_10","CCC_10","DDD_10"),new = c("AAA_99","BBB_99","CCC_99","DDD_99"))
#rename sampleIDs
subset_genepop_rename(genepop= GenePopData, nameframe = IndRename_group, meta="Ind",path = paste0(output_dir,"Genepop_ID_renamed.txt"))
If your sample IDs are not split between the population and the sample number using an then you can rename the populations before using any subsetting functions. Note that in this example because there is already a separating populations the
#view the sample IDs
PopNames <- genepop_detective(GenePopData,"Inds")
PopNames_fixed <- paste0(PopNames,"_") #add the underscore seperator
#create a dataframe for renaming.
PopRename_fixed <- data.frame(oPop = PopNames, newname = PopNames_fixed)
#rename populations * note here you will keep the original sample #s (renumber = F)
subset_genepop_rename(genepop= GenePopData, path = paste0(output_dir,"Genepop_renamed_renumbered.txt"),nameframe = PopRename_fixed, renumber = FALSE)
Now lets change group some populations together. In this example we will combine populations DDD/EEE & FFF/HHH and remove populations JJJ/GGG (not listed in column 1 of PopRename). No loci will be removed from this dataframe (subs = NULL). This is useful when you don't want a clustering program (e.g. STRUCTURE or Bayescan) to assume differences among population groups. Function description
#Use genepop_detective to find the meta-data for populations
genepop_detective(GenePopData,"Pops")
#create a dataframe for renaming.
#column 1 = the original pop names and the list of pops required.
#column 2 = Grouping variables.
PopAggregate <-data.frame(oP = c("AAA","BBB","CCC","DDD","EEE","FFF","HHH","III"),
agname = c("AAA","Pop1","CCC","Pop1","DDD","Pop2","HHH","Pop2"))
#re-cast Genepop format to group populations based on 'PopAggregate'
subset_genepop_aggregate(genepop= GenePopData, subs = NULL, path = paste0(output_dir,"Genepop_grouped.txt"),agpopframe = PopAggregate)
Now that we have two grouped populations, we can use the population rename function to give them a common name.
#read in the grouped Genepop file
GenePopData2 <-read.table(paste0(output_dir,"Genepop_grouped.txt"),header = FALSE, sep = "\t", quote = "", stringsAsFactors = FALSE)
#investigate the population parameters
genepop_detective(GenePopData2,"Pops")
#we can see that "BBB" & "DDD" and "FFF" & "III" are at the end but still have different names.
#create renaming frame
PopRename <- data.frame(oPop = c("AAA","CCC","EEE","HHH","BBB","DDD","FFF","III"),
newname = c("AAA","CCC","EEE","HHH","Group1","Group1","Group2","Group2"))
#Rename the grouped populations
subset_genepop_rename(genepop= GenePopData2, nameframe = PopRename, path = paste0(output_dir,"Genepop_grouped_renamed.txt"))
Remove some individual samples from the dataframe. Here we consider the 'individual' ID to be an alpha-numeric code which at the start of each row of data. The keep parameter defines whether the specified list is removed or retained (default: keep = FALSE). This function is best run at the beginning of the analysis prior to removal of specific loci or populations. Function description
#Use genepop_detective to see the naming structure.
genepop_detective(GenePopData,"Inds")
#vector of sample IDs to remove
subid <- c("AAA_1","AAA_3","BBB_20","CCC_21","EEE_3","EEE_26","HHH_25","JJJ_4")
subset_genepop_individual(genepop= GenePopData, indiv = subid, keep = FALSE, path = paste0(output_dir,"Genepop_IDsubset.txt"))
Reorder the genepop file sequentially according to a specified ordering (see below for examples). Function description
#Identify populaton names
pops <- genepop_detective(genepop="Genepop_IDsubset.txt")
#reorder populatons. In this example the first and last example will be switched
genepop_reorder(genepop="Genepop_IDsubset.txt",reorder=pops[c(length(pops),2:(length(pops)-1),1)]),path=paste0(output_dir,"GenePop_reordered.txt")
Function which will allow you to create subsets of the genepop. This might be useful for creating training and leave-out datasets for assignment. Function description
#view how many individuals you have in each population
genepop_detective(GenePopData,"PopNum")
There are several ways we can sub-sample each population
Use a fixed number (e.g. 5 from each population)
SubSamp <- genepop_sample(GenePopData, nsample = 5)
or
Use a fixed fraction (e.g. 25% from each population)
SubSamp <- genepop_sample(GenePopData, nsample = 0.25)
or
Use a fraction which varies for each population
#vector of populations in the Genepop file
Pops <- genepop_detective(GenePopData,"Pops")
#create a dataframe which defines how many you want to sample from each population. Note here we create a column which has an entry for each population.
subdf <- data.frame(Pops = Pops, nsample = c(5, 2, 3, 5, 6, 5, 2, 3, 5, 6))
SubSamp <- genepop_sample(GenePopData, nsample = subdf)
or
Use a fraction which varies for each population
#vector of populations in the Genepop file
Pops <- genepop_detective(GenePopData,"Pops")
#create a dataframe which defines how many you want to sample from each population. Note here we create a column which has an entry for each population.
subdf <- data.frame(Pops = Pops, nsample = c(5, 2, 3, 5, 6, 5, 2, 3, 5, 6)/10)
SubSamp <- genepop_sample(GenePopData, nsample = subdf)
Once you select a method to sub-sample each population, you can use the function subset-genepop-indiv to create the sampled genepop file.
For example if you wanted to create a training and assignment dataset in genepop format using the random stratified sampled we selected using subset-genepop-indiv:
#Create training dataset (keep = TRUE)
subset_genepop_individual(GenePopData, indiv = SubSamp, keep = TRUE, path = paste0(output_dir,"Genepop_training.txt"))
#Create an assignment dataset using the remaining individuals (keep = FALSE)
subset_genepop_individual(GenePopData, indiv = SubSamp, keep = FALSE, path = paste0(output_dir,"Genepop_assignment.txt"))
If you are interested in investigating population structure you can convert your modified Genepop object or path to saved file directly into a STRUCTURE formatted text (.str) file <>. Function description
#convert Genepop format to STRUCTURE to default groupings
genepop_structure(genepop="Genepop_IDsubset.txt",path = paste0(output_dir,"Sturcture_IDsubset_groups.txt"))
# Specify population groupings (group CCC/DDD/EEE and III/JJJ)
pGroups <- data.frame(pops = c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ"),groups = c("1","2","3","3","3","4","5","6","7","7"))
#convert Genepop format to STRUCTURE with new groupings
genepop_structure(genepop="Genepop_IDsubset.txt",popgroup = pGroups, path = paste0(output_dir,"Sturcture_IDsubset_groups.txt"))
If you are interested in doing calculating gene diversities and-or F-statistics with the cleaned loci datasets, you can convert from Genepop to FSTAT format This data can be used by the R package hierfstat or the program FSTAT Function description
#convert Genepop format to FSTAT (.dat)
genepop_fstat(genepop="Genepop_IDsubset.txt",path = paste0(output_dir,"Fstat_IDsubset.dat"))
#convert Genepop format to FSTAT but keep the data in the workspace
genepop_fstat(genepop="Genepop_IDsubset.txt",addworkspace = TRUE)
If you are interested in testing for hybridization you can convert from Genepop to the format necessary for the program New Hybrids Function description
#convert Genepop format to New Hybrids format (.txt)
genepop_newhybrids(genepop="Genepop_IDsubset.txt",path = paste0(output_dir,"NewHybrids_IDsubset.txt"))
If you are interested conducting assignment analysis, you can convert to the format required for the R package assigner Function description
#convert Genepop format to assigner format (.txt) using the populations each as their own assessment level
genepop_assigner(genepop="Genepop_IDsubset.txt",path = paste0(output_dir,"assigner_IDsubset.txt"))
# Set new population assignment groupings (group CCC/DDD/EEE and III/JJJ)
pGroups <- data.frame(pops = c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ"),groups = c("1","2","3","3","3","4","5","6","7","7"))
genepop_assigner(genepop="Genepop_IDsubset.txt",popgroup = pGroups, path = paste0(output_dir,"assigner_IDsubset_NewGroups.txt"))
If you are interested in investigating sibship or parentage, you can convert to the format required for the software Colony. Returned are functions for translating the Colony output and those required by the maximum likelihood analysis. 'Individual' and 'Loci' conversion files can be used to match the loci and individual names formatted for input to Colony to those provided as input by the user.'MarkerTypeErrorRT' file defines the allele dropout rates as calculated using the 'missing' function in PLINK. Rates of other error types are assumed to be equal to the estimated allele dropout rates. 'GENOTYPES' are the geneotypes in two columns per loci format. Output from Colony is based on the input for of the 'GENOTYPES' file. Original loci and individual names can be assigned using the conversion output files. All files are returned to path and are named according tot he input filename. Function description
#convert Genepop format to the files necessary for Colony
genepop_colony(genepop="Genepop_IDsubset.txt",where.plink="c:/Users/YOURNAME/Documents/Programs/plink/",where.pgdspider="c:/Users/YOURNAME/Documents/Programs/PGDSpider_2.0.9.0/",denote.missing = "000",path = output_dir)
genepopedit is one of the few tools available to convert data to the format required for input into to the Bayesian estimation of Genomic Clines (BGC) format. BGC can evalute genomic clinal patterns and introgression among loci. To convert to BGC format you must specify which populations you consider to be ancestral or parental (P1, P2) and which you assume could be hybridized (admixed). Function description
#specify which populations are going to be included in the analysis and to which class they belong. Note Pops identified in P1 and-or P2 can also be specified as "Admixed" to test BGC output.
BGC_groups = data.frame(pops = c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ"),groups = c("P1","P1","P1","Admixed","Admixed","Admixed","Admixed","P2","P2","P2"))
#convert Genepop to BGC input files (3). Note in this case the variable path is a path to the directory where the input files will be saved.
genepop_bgc(genepop="Genepop_IDsubset.txt",popdef = BGC_groups, fname="BGC_IDsubset",path = output_dir)
genepopedit is one of the few tools available to convert data to the format required for input into the Treemix format. It creates a gzipped file that clusters your individuals based on the population ID of the individual's code. Population IDs are extracted using the "_" to differentiate population from sample ID (e.g. BON_01 is population BON sample 01). If samples are not separated refer to _genepopID. Grouping levels for populations can also be changed using _subset_genepoprename. Note that this is just the first of two steps in the process; the gzipped output from this function needs to be run through the Python script (Python 2.7+ must be installed first) that accompanies Treemix. This output from "plink2treemix.py" is then ready to run in Treemix. This Python script can be downloaded from link. Treemix is used to infer migration weight and directionality among your populations, as well as detecting population splits and mixtures. Function description
#convert Genepop to Treemix input and keep intermediary conversion files (default: FALSE)
genepop_treemix(genepop="Genepop_IDsubset.txt",where.plink="c:/Users/YOURNAME/Documents/Programs/plink/",where.pgdspider="c:/Users/YOURNAME/Documents/Programs/PGDSpider_2.0.9.0/",keep_inter = TRUE, path = paste0(output_dir,"Treemix_IDsubset.txt"))
If you are interested in assessing the accuracy of a genetic stock identification analysis, given a genetic baseline, you can convert directly from GENEPOP to a GSI_sim formatted text file. genepopedit is among the few tools currenlty available to format data for GSI_sim. Function description
#convert Genepop to GSIsim format
genepop_GSIsim(genepop="Genepop_IDsubset.txt",path = paste0(output_dir,"GSIsim_IDsubset.txt"))
If you are interested in exploring clines in allele frequency over distance, you can convert from Genepop to HZAR format.This data can be used by the R package hzar Function description
#convert Genepop format to HZAR (.csz)
genepop_hzar(genepop="Genepop_IDsubset.txt",distances=data.frame(Pop=genepop_detective("Genepop_IDsubset.txt"), Distance=c(5,8,22,3,84,39,45.5,67,101),stringsAsFactors=FALSE), path = paste0(output_dir,"HZAR_Input.csv"))
Flatten your geneotype data for other calculations. This function will convert and return the genepop format into a dataframe. Because this function returns the dataframe, it needs to be assigned a variable ID. Note that this function does not have any subsetting capabilities. If you would like to remove loci and-or populations, this must be before hand using subset_genepop or its sister functions (see below for examples). Function description
#Flatten the dataframe
GenePop_df<- genepop_flatten(GenePopData)
#inspect the output for the first 10 columns
head(GenePop_df[,1:10])
If you are working with a flattened dataframe in your workspace you can convert it back to Genepop format. **Note that the first column of this data.frame should correspond to the sampleID (e.g. "BON_01") and the remaining columns should be Loci. Function description
# create a Genepop format which can be used by other genepopedit functions
genepop_unflatten(GenePop_df, path = paste0(output_dir,"GenePop_UNFLATTENED.txt"))
There are a broad range of data formats which can be converted to and from using PGDspider. If you have multiple files which you are going to be converting using the same conversion parameters, PGDspideR.R can be useful, providing a code based interface to exploit the conversion functions of PGDspider. Function description
#convert between GENEPOP and FSTAT format using PGDspider
pgdSpideR(input = paste0(output_dir,"Genepop_IDsubset.txt"),
input_format="GENEPOP",
output = paste0(output_dir,"Genepop_IDsubset_FSTAT.dat",
output_format="FSTAT",
spid="c:/Users/YOURNAME/Documents/spids/GENEPOP_FSTAT.spid",
where.pgdspider="c:/Users/YOURNAME/Documents/Programs/PGDSpider_2.0.9.0/")
Individual geneotypes can be simulated using the _alleleotypegenepop() function and formatted to GENEPOP for use in conventional genomic analyses (e.g. assigner, DAPC, STRUCTURE). This function will simulate individuals for each population provided by the input file. The output can then be diagnosed, sampled, and formatted using other genepopedit functions. Function description
#simulate individual geneotypes allele frequencies stratified by population.
#Create an example pooled DNA input file
pooledDNA <- data.frame(Pop = c("Pop1","Pop2","Pop3","Pop4","Pop5"),Loci1 = round(runif(5, 0, 1),2),Loci2 = round(runif(5, 0, 1),2),Loci3 = round(runif(5, 0, 1),2),Loci4 = round(runif(5, 0, 1),2),Loci5 = round(runif(5, 0, 1),2),Loci6 = round(runif(5, 0, 1),2))
#inspect the input
pooledDNA
#simulate 200 individual geneotypes for each population
alleleotype_genepop(pooledDNA, numsim = 200, path = paste0(output_dir,"SimulatedGeneotypes.txt"))
#To validate output simulate 100 (default) geneotypes per population and compare allele frequencies to those of the input file. Estimated frequencies per population and SNP should match those of the input file within +/- 1%.
alleleotype_genepop(pooledDNA, numsim = 100, path = paste0(output_dir,"SimulatedGeneotypesValidate.txt"))
genepop_allelefreq(paste0(output_dir,"SimulatedGeneotypesValidate.txt"))
#Note you can simulate new geneotypes using the summary from genepop_allelefreq(... , wide = TRUE).
GeneFreq <- genepop_allelefreq("Genepop_IDsubset.txt",wide = TRUE)
alleleotype_genepop(GeneFreq, numsim = 100, path = paste0(output_dir,"SimulatedGeneotypes_Genepop_IDsubset.txt"))