Closed vpnagraj closed 3 years ago
code below demonstrates usage with simulated .seg
data included in the package. this example also compares the impact of different genetic maps ("hapmap" vs "bherer") on the kinship coefficient retrieved from simulated segments:
library(tidyverse)
library(skater)
plinkmaps <- list(hapmap = read_map("/data/projects/skate/data/hapmap/plink.allchr.GRCh37.map", source = "hapmap"),
bherer = read_map("/data/projects/skate/data/pedsim_files/refined_mf.simmap", source = "bherer"))
## read in segment file from ped sim
## split so that returned value is a list
seglikeibd <- read_ibd(system.file("extdata", "GBR.sim.seg.gz", package="skater"),
source = "pedsim",
split = TRUE)
degree_resolution <- 3
truth <- read_fam(system.file("extdata", "GBR.sim.fam", package="skater")) %>%
fam2ped() %>%
mutate(x=map(ped, ped2kinpair)) %>%
select(x) %>%
unnest(x) %>%
mutate(d.truth=kin2degree(k, max_degree=degree_resolution)) %>%
rename(k.truth=k) %>%
print()
res <- list()
for(map_name in names(plinkmaps)) {
plinkmap <- plinkmaps[[map_name]]
ibd1_dat <- ibd2kin(.ibd_data = seglikeibd$IBD1, .map = plinkmap, type = "IBD1")
ibd2_dat <- ibd2kin(.ibd_data = seglikeibd$IBD2, .map = plinkmap, type = "IBD2")
res[[map_name]] <-
bind_rows(ibd1_dat,ibd2_dat) %>%
group_by(id1,id2) %>%
summarise(kinship = sum(kinship)) %>%
mutate(degree = kin2degree(kinship)) %>%
left_join(truth) %>%
mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated")))) %>%
mutate(map = map_name)
}
res <- do.call("rbind", res)
res %>%
ggplot(aes(kinship)) +
geom_density(aes(fill = d.truth), alpha = 0.5) +
geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) +
geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=rev(x), col=k.theoretical), lty=1, alpha=.6) +
scale_color_manual(values=rev(scales::hue_pal()(degree_resolution+1)[1:degree_resolution])) +
theme_classic() +
scale_x_continuous(breaks=.5/2^(1:degree_resolution)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme_classic() +
labs(x = "k") +
facet_wrap(~ map, ncol = 1)
@stephenturner over to you for review on this one.
curious to hear your thoughts on the read_ibd() %>% ibd2kin()
workflow for handling IBD1 vs IBD2.
it works, but feels a little clunky. only way i can think to improve beyond where its at now is to do some significant refactoring of internal code in ibd2kin()
. would prefer to not do that ... but if the usage is too annoying then im up for it.
@stephenturner over to you for review on this one.
curious to hear your thoughts on the
read_ibd() %>% ibd2kin()
workflow for handling IBD1 vs IBD2.it works, but feels a little clunky. only way i can think to improve beyond where its at now is to do some significant refactoring of internal code in
ibd2kin()
. would prefer to not do that ... but if the usage is too annoying then im up for it.
Let's walk through this tomorrow. I see what it's doing, and we don't need this code to be super elegant. Clunky? Maybe. "Works" and I'm satisfied for now.
think ive addressed everything discussed in recent commits:
read_map()
no longer has source argument. expects "Plink" map input format.read_ibd()
will always split if source == "pedsim"read_ibd()
now actually returns an empty tibble (instead of a tibble with one row of NAs) given empty file inputexamples for running read_ibd() %>% ibd2kin()
... plots not shown ... for simulated .seg
:
library(tidyverse)
library(skater)
plinkmap <- read_map("/data/projects/skate/data/pedsim_files/averaged.simmap")
## read in segment file from ped sim
seglikeibd <- read_ibd(system.file("extdata", "GBR.sim.seg.gz", package="skater"), source = "pedsim")
degree_resolution <- 3
truth <- read_fam(system.file("extdata", "GBR.sim.fam", package="skater")) %>%
fam2ped() %>%
mutate(x=map(ped, ped2kinpair)) %>%
select(x) %>%
unnest(x) %>%
mutate(d.truth=kin2degree(k, max_degree=degree_resolution)) %>%
rename(k.truth=k) %>%
print()
ibd1_dat <- ibd2kin(.ibd_data = seglikeibd$IBD1, .map = plinkmap, type = "IBD1")
ibd2_dat <- ibd2kin(.ibd_data = seglikeibd$IBD2, .map = plinkmap, type = "IBD2")
res <-
bind_rows(ibd1_dat,ibd2_dat) %>%
group_by(id1,id2) %>%
summarise(kinship = sum(kinship)) %>%
mutate(degree = kin2degree(kinship)) %>%
left_join(truth) %>%
mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated"))))
res %>%
ggplot(aes(kinship)) +
geom_density(aes(fill = d.truth), alpha = 0.5) +
geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) +
geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=rev(x), col=k.theoretical), lty=1, alpha=.6) +
scale_color_manual(values=rev(scales::hue_pal()(degree_resolution+1)[1:degree_resolution])) +
theme_classic() +
scale_x_continuous(breaks=.5/2^(1:degree_resolution)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme_classic() +
labs(x = "k")
and for hapibd
segments:
## read in segment file from hapibd
hapibd <- read_ibd(system.file("extdata", "GBR.sim.ibd.gz", package="skater"), source = "hapibd")
ibd_dat <- ibd2kin(.ibd_data = hapibd, .map = plinkmap)
res <-
ibd_dat %>%
mutate(degree = kin2degree(kinship)) %>%
left_join(truth) %>%
mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated"))))
res %>%
ggplot(aes(kinship)) +
geom_density(aes(fill = d.truth), alpha = 0.5) +
geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) +
geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=rev(x), col=k.theoretical), lty=1, alpha=.6) +
scale_color_manual(values=rev(scales::hue_pal()(degree_resolution+1)[1:degree_resolution])) +
theme_classic() +
scale_x_continuous(breaks=.5/2^(1:degree_resolution)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme_classic() +
labs(x = "k")
@vpnagraj this is GTG by me. Take a look at one change I made in https://github.com/signaturescience/skater/commit/c874b406966514c66b3b7718ba358dc1a4ba90cb, see if this looks good to you.
Those hap-ibd tables can be hundreds of thousands of lines long. Reading in the whole table (twice), the first time just to check that nrow>0 isn't necessary. I changed this to read in a single line, and if the file is empty return a FALSE. I think the end result is the same.
some independent sanity checking
# Create an empty file and read it in
emptyfile <- tempfile()
file.create(emptyfile)
# Create a 500k row by 10 column file
bigfile <- tempfile()
readr::write_tsv(as.data.frame(matrix(rnorm(5e6), ncol=10)), file=bigfile)
# An empty file
# The new way
length(readr::read_lines(emptyfile, n_max=1L))>0
system.time(length(readr::read_lines(emptyfile, n_max=1L))>0)
# The old way
nrow(readr::read_tsv(emptyfile, col_names = FALSE))>0
system.time(nrow(readr::read_tsv(emptyfile, col_names = FALSE))>0)
# A big file
# The new way
length(readr::read_lines(bigfile, n_max=1L))>0
system.time(length(readr::read_lines(bigfile, n_max=1L))>0)
# The old way
nrow(readr::read_tsv(bigfile, col_names = FALSE))>0
system.time(nrow(readr::read_tsv(bigfile, col_names = FALSE))>0)
this PR brings in some critical changes to
read_ibd()
,ibd2kin()
, andread_map()
:read_ibd()
will now return an "empty" tibble if the IBD.gz input file from inferred hap-ibd segments is empty.read_ibd()
now conditionally returns a list of IBD1 or IBD2 if "split" isTRUE
(only applies to ped-sim simulated segment input)ibd2kin
can conditionally handle IBD1 vs IBD2 differently ... NOTE at the moment you have to compute IBD1 and IBD2 separately then sum theibd2kin
output from eachread_map()
now accepts a sex-specific "bherer" genetic map (i.e. the one recommended by ped-sim devs) as map "source" ... internally averages the sex-specific mapfixes #37 and fixes #44