Closed gtonkinhill closed 8 years ago
are we going for tidy data.frames here so easily piped into ggplot2, i.e. for k = 2 with 3 samples A, B, C
sample.id | membership | proportion A | k1 | 0.5 A | k2 | 0.5 B | k1 | 0 B | k2 | 1 C| k1 | 0.7 C | k2 | 0.3
Yes I think so. But have a further columns that specify the run parameters so we can have multiple values of k.
i.e sample.id | membership | proportion | K A | k1 | 0.5 | 2 A | k2 | 0.5 | 2 B | k1 | 0 | 2 B | k2 | 1 | 2 C| k1 | 0.7 | 2 C | k2 | 0.3 | 2
Some initial code:
admixtureInput <- read.table("./admixtureOut.ped", quote="\"", comment.char="")
colnames(admixtureInput)[1:6] <- c("FamilyID", "SampleID", "Paternal", "Maternal", "Sex", "Affection")
files <- Sys.glob("./*.Q")
QData <- lapply(files, fread, sep="auto", header=FALSE)
QData <- lapply(QData, function(dt){
df <- data.frame(dt)
K <- ncol(df)
df <- cbind(admixtureInput[1:6], df)
df$K <- rep(K, nrow(df))
df <- melt(df, id.vars=c(colnames(admixtureInput)[1:6], "K"), variable.name="Cluster")
return(df)
})
QData.df <- rbindlist(QData)
I think I've resolved this for the most part - quick q how do you want to handle the SNP data, i.e. the P files? See commit https://github.com/sa-lee/starmie/commit/d64c719300aa2e305a078b4ede8ce4c02772f845
It looks good! I think I preferred checking k by looking at the file rather than the file name however. Then it is less reliant on people using sensible naming strategies.
I agree that's probably safer. Feel free to update. At the moment the k argument is kinda redundant so we could get the user to pass it in so we know which admixture runs have been done.
Also what are your thoughts on using data.table's fread?
I'm open to using it. Generally speaking I prefer using dplyr/readr for readability sake but if you're more comfortable with data.table happy to switch up my style.
Ah, I was unaware of readr. Usually I use data.ables fread and output it as a data.frame for use with dply etc.
Perhaps its best to keep within the hadley ecosystem
I think we can close this now see: https://github.com/sa-lee/starmie/blob/dfe78f42294f3fd09c7a864571239cffb2eb3047/R/loadAdmixture.R
Looks pretty good. What sample data needs to be provided first? Does it require a sample ID for each input file to be in the same order?
Fair point. Need to have sample data loaded into starmie object before reading admixture files. The output of the admixture Q files will be in the same order as the individual IDs in a PLINK fam/ped file. I don't think will be an issue since one starmie object is containing information for an entire cohort.
okay sounds good!
Added an example to the docs. I'll close this for now but there'll be bugs in the future haha.
A function to load multiple Admixture output files into a useful data frame.
Input: Multiple Admixture .P and .Q files The initial input file fed to Admixture
Output: A sensible data frame that make for easy downstream analysis