lvclark / polyRAD

Genotype Calling with Uncertainty from Sequencing Data in Polyploids πŸŒπŸ“πŸ₯”πŸ πŸ₯
24 stars 8 forks source link

Export to additional population genetics formats #10

Open lvclark opened 3 years ago

lvclark commented 3 years ago

Overview

There has been some interest in using genotypes called by polyRAD for population genetics analysis. Two pieces of software that have long supported polyploid genotypes are Structure and SPAGeDi. Although they were originally designed for relatively small sets of PCR markers, they both scale up to large sequence-based datasets. They each use a custom format, and it would be nice if polyRAD exported to those formats.

I may eventually get to writing these export functions. If you would like to see them (or others), feel free to comment here! If you have some R experience and would like to write the functions yourself, see below. First-time contributors are welcome!

Contributing

You will need to make your own fork of polyRAD, clone that fork to your computer, commit your changes, push those commits to GitHub, and open a pull request. See the Github documentation if you need help understanding those tasks.

R code

Name the function Export_Structure or Export_SPAGeDi, as appropriate. Add it to the file R/data_export.R.

The first argument for the function should be a RADdata object named object. There should also be an argument called file indicating the path to write to. Another possible argument might indicate a particular ploidy to use for export; see other export functions, and check the documentation for Structure and SPAGeDi to see if variable ploidy is allowed.

The function will need to call GetProbableGenotypes with omit1allelePerLocus = FALSE and multiallelic = "correct". This will get you a list, the first item of which is a matrix with individuals and rows and alleles in columns, showing the copy number of that allele in that individual. This will then need to be converted to a different format, where each allele is assigned an integer, and the individual has a number of alleles up to its ploidy. For example:

Allele_AG Allele_AA Allele_CA
Ind 1 0 3 1
Ind 2 2 2 0

Gets changed to

Allele 1 Allele 2 Allele 3 Allele 4
Ind 1 2 2 2 3
Ind 2 1 1 2 2

Your function will make use of object$alleles2loc in order to group columns of the matrix output by GetProbableGenotypes into loci. A loop in R to process loci might look like:

geno <- GetProbableGenotypes(object, omit1allelePerLocus = FALSE)[[1]]

for(L in seq_len(nLoci(object)){
  theseal <- which(object$alleles2loc == L)
  thesegeno <- geno[, theseal]
  # do some processing here
}

See the MakeGTstrings internal function in src/PrepVCFexport.cpp. For SPAGeDi in particular this function could simply have an argument added to it to start numbering from 1 rather than 0. For Structure, MakeGTstrings should probably not be used directly but can provide some inspiration for what needs to be done.

See the Structure or SPAGeDi documentation for details on file format needed. Your function should export a text file in that format.

Test the function on exampleRAD after running IterateHWE. Install Structure or SPAGeDi on your computer and see if it can import the file.

Documentation and integration

lvclark commented 3 years ago

I have written an Export_Structure function since a user was asking for it, but someone can still write a function to export to SPAGeDi if interested.

lvclark commented 3 years ago

Another format which might be appreciated by plant breeders is HapMap. To the best of my knowledge, there isn't software that can import a HapMap file with polyploid genotypes, but the format has the advantage of being easily comprehensible to the human eye. A tetraploid SNP genotype could for example be represented as AAAG. The Hap2SNP function should be very helpful for converting polyRAD's haplotype-based alleles back to SNPs. See also the above note about GetProbableGenotypes and MakeGTstrings.

lvclark commented 3 years ago

There are also requests for export to GenoDive and to adegenet::genind objects.