sa-lee / starmie

starmie: plotting and inference for population structure models :star2:
Other
12 stars 6 forks source link

Load Admixture Data #1

Closed gtonkinhill closed 8 years ago

gtonkinhill commented 8 years ago

A function to load multiple Admixture output files into a useful data frame.

Input: Multiple Admixture .P and .Q files The initial input file fed to Admixture

Output: A sensible data frame that make for easy downstream analysis

sa-lee commented 8 years ago

are we going for tidy data.frames here so easily piped into ggplot2, i.e. for k = 2 with 3 samples A, B, C

sample.id | membership | proportion A | k1 | 0.5 A | k2 | 0.5 B | k1 | 0 B | k2 | 1 C| k1 | 0.7 C | k2 | 0.3

gtonkinhill commented 8 years ago

Yes I think so. But have a further columns that specify the run parameters so we can have multiple values of k.

i.e sample.id | membership | proportion | K A | k1 | 0.5 | 2 A | k2 | 0.5 | 2 B | k1 | 0 | 2 B | k2 | 1 | 2 C| k1 | 0.7 | 2 C | k2 | 0.3 | 2

gtonkinhill commented 8 years ago

Some initial code:

admixtureInput <- read.table("./admixtureOut.ped", quote="\"", comment.char="")
colnames(admixtureInput)[1:6] <- c("FamilyID", "SampleID", "Paternal", "Maternal", "Sex", "Affection") 

files <- Sys.glob("./*.Q")

QData <- lapply(files, fread, sep="auto", header=FALSE)
QData <- lapply(QData, function(dt){
  df <- data.frame(dt)
  K <- ncol(df)
  df <- cbind(admixtureInput[1:6], df)
  df$K <- rep(K, nrow(df))
  df <- melt(df, id.vars=c(colnames(admixtureInput)[1:6], "K"), variable.name="Cluster")
  return(df)
})
QData.df <- rbindlist(QData)
sa-lee commented 8 years ago

I think I've resolved this for the most part - quick q how do you want to handle the SNP data, i.e. the P files? See commit https://github.com/sa-lee/starmie/commit/d64c719300aa2e305a078b4ede8ce4c02772f845

gtonkinhill commented 8 years ago

It looks good! I think I preferred checking k by looking at the file rather than the file name however. Then it is less reliant on people using sensible naming strategies.

sa-lee commented 8 years ago

I agree that's probably safer. Feel free to update. At the moment the k argument is kinda redundant so we could get the user to pass it in so we know which admixture runs have been done.

gtonkinhill commented 8 years ago

Also what are your thoughts on using data.table's fread?

sa-lee commented 8 years ago

I'm open to using it. Generally speaking I prefer using dplyr/readr for readability sake but if you're more comfortable with data.table happy to switch up my style.

gtonkinhill commented 8 years ago

Ah, I was unaware of readr. Usually I use data.ables fread and output it as a data.frame for use with dply etc.

gtonkinhill commented 8 years ago

Perhaps its best to keep within the hadley ecosystem

sa-lee commented 8 years ago

I think we can close this now see: https://github.com/sa-lee/starmie/blob/dfe78f42294f3fd09c7a864571239cffb2eb3047/R/loadAdmixture.R

gtonkinhill commented 8 years ago

Looks pretty good. What sample data needs to be provided first? Does it require a sample ID for each input file to be in the same order?

sa-lee commented 8 years ago

Fair point. Need to have sample data loaded into starmie object before reading admixture files. The output of the admixture Q files will be in the same order as the individual IDs in a PLINK fam/ped file. I don't think will be an issue since one starmie object is containing information for an entire cohort.

gtonkinhill commented 8 years ago

okay sounds good!

sa-lee commented 8 years ago

Added an example to the docs. I'll close this for now but there'll be bugs in the future haha.