edits to read_ibd and ibd2kin and read_map

vpnagraj commented 3 years ago

this PR brings in some critical changes to read_ibd(), ibd2kin(), and read_map():

read_ibd() will now return an "empty" tibble if the IBD.gz input file from inferred hap-ibd segments is empty.
read_ibd() now conditionally returns a list of IBD1 or IBD2 if "split" is TRUE (only applies to ped-sim simulated segment input)
ibd2kin can conditionally handle IBD1 vs IBD2 differently ... NOTE at the moment you have to compute IBD1 and IBD2 separately then sum the ibd2kin output from each
read_map() now accepts a sex-specific "bherer" genetic map (i.e. the one recommended by ped-sim devs) as map "source" ... internally averages the sex-specific map

fixes #37 and fixes #44

vpnagraj commented 3 years ago

code below demonstrates usage with simulated .seg data included in the package. this example also compares the impact of different genetic maps ("hapmap" vs "bherer") on the kinship coefficient retrieved from simulated segments:

library(tidyverse)
library(skater)

plinkmaps <- list(hapmap = read_map("/data/projects/skate/data/hapmap/plink.allchr.GRCh37.map", source = "hapmap"),
                  bherer = read_map("/data/projects/skate/data/pedsim_files/refined_mf.simmap", source = "bherer"))

## read in segment file from ped sim
## split so that returned value is a list
seglikeibd <- read_ibd(system.file("extdata", "GBR.sim.seg.gz", package="skater"),
                       source = "pedsim",
                       split = TRUE)

degree_resolution <- 3

truth <- read_fam(system.file("extdata", "GBR.sim.fam", package="skater")) %>%
  fam2ped() %>%
  mutate(x=map(ped, ped2kinpair)) %>%
  select(x) %>%
  unnest(x) %>%
  mutate(d.truth=kin2degree(k, max_degree=degree_resolution)) %>%
  rename(k.truth=k) %>%
  print()

res <- list()

for(map_name in names(plinkmaps)) {

  plinkmap <- plinkmaps[[map_name]]

  ibd1_dat <- ibd2kin(.ibd_data = seglikeibd$IBD1, .map = plinkmap, type = "IBD1") 
  ibd2_dat <- ibd2kin(.ibd_data = seglikeibd$IBD2, .map = plinkmap, type = "IBD2")

  res[[map_name]] <-
    bind_rows(ibd1_dat,ibd2_dat) %>%
    group_by(id1,id2) %>%
    summarise(kinship = sum(kinship)) %>%
    mutate(degree = kin2degree(kinship)) %>%
    left_join(truth) %>%
    mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated")))) %>%
    mutate(map = map_name)

}

res <- do.call("rbind", res)

res %>%
  ggplot(aes(kinship)) +
  geom_density(aes(fill = d.truth), alpha = 0.5) +
  geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) + 
  geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=rev(x), col=k.theoretical), lty=1, alpha=.6) +
  scale_color_manual(values=rev(scales::hue_pal()(degree_resolution+1)[1:degree_resolution])) + 
  theme_classic() + 
  scale_x_continuous(breaks=.5/2^(1:degree_resolution)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme_classic() +
  labs(x = "k") +
  facet_wrap(~ map, ncol = 1)

bherer_v_hapmap

vpnagraj commented 3 years ago

@stephenturner over to you for review on this one.

curious to hear your thoughts on the read_ibd() %>% ibd2kin() workflow for handling IBD1 vs IBD2.

it works, but feels a little clunky. only way i can think to improve beyond where its at now is to do some significant refactoring of internal code in ibd2kin(). would prefer to not do that ... but if the usage is too annoying then im up for it.

stephenturner commented 3 years ago

@stephenturner over to you for review on this one.

curious to hear your thoughts on the read_ibd() %>% ibd2kin() workflow for handling IBD1 vs IBD2.

it works, but feels a little clunky. only way i can think to improve beyond where its at now is to do some significant refactoring of internal code in ibd2kin(). would prefer to not do that ... but if the usage is too annoying then im up for it.

Let's walk through this tomorrow. I see what it's doing, and we don't need this code to be super elegant. Clunky? Maybe. "Works" and I'm satisfied for now.

vpnagraj commented 3 years ago

think ive addressed everything discussed in recent commits:

read_map() no longer has source argument. expects "Plink" map input format.
read_ibd() will always split if source == "pedsim"
read_ibd() now actually returns an empty tibble (instead of a tibble with one row of NAs) given empty file input

examples for running read_ibd() %>% ibd2kin() ... plots not shown ... for simulated .seg:

library(tidyverse)
library(skater)

plinkmap <- read_map("/data/projects/skate/data/pedsim_files/averaged.simmap")

## read in segment file from ped sim
seglikeibd <- read_ibd(system.file("extdata", "GBR.sim.seg.gz", package="skater"), source = "pedsim")

degree_resolution <- 3

truth <- read_fam(system.file("extdata", "GBR.sim.fam", package="skater")) %>%
  fam2ped() %>%
  mutate(x=map(ped, ped2kinpair)) %>%
  select(x) %>%
  unnest(x) %>%
  mutate(d.truth=kin2degree(k, max_degree=degree_resolution)) %>%
  rename(k.truth=k) %>%
  print()

ibd1_dat <- ibd2kin(.ibd_data = seglikeibd$IBD1, .map = plinkmap, type = "IBD1") 
ibd2_dat <- ibd2kin(.ibd_data = seglikeibd$IBD2, .map = plinkmap, type = "IBD2")

res <-
  bind_rows(ibd1_dat,ibd2_dat) %>%
  group_by(id1,id2) %>%
  summarise(kinship = sum(kinship)) %>%
  mutate(degree = kin2degree(kinship)) %>%
  left_join(truth) %>%
  mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated"))))

res %>%
  ggplot(aes(kinship)) +
  geom_density(aes(fill = d.truth), alpha = 0.5) +
  geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) + 
  geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=rev(x), col=k.theoretical), lty=1, alpha=.6) +
  scale_color_manual(values=rev(scales::hue_pal()(degree_resolution+1)[1:degree_resolution])) + 
  theme_classic() + 
  scale_x_continuous(breaks=.5/2^(1:degree_resolution)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme_classic() +
  labs(x = "k")

and for hapibd segments:

## read in segment file from hapibd
hapibd <- read_ibd(system.file("extdata", "GBR.sim.ibd.gz", package="skater"), source = "hapibd")

ibd_dat <- ibd2kin(.ibd_data = hapibd, .map = plinkmap) 

res <-
  ibd_dat %>%
  mutate(degree = kin2degree(kinship)) %>%
  left_join(truth) %>%
  mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated"))))

res %>%
  ggplot(aes(kinship)) +
  geom_density(aes(fill = d.truth), alpha = 0.5) +
  geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) + 
  geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=rev(x), col=k.theoretical), lty=1, alpha=.6) +
  scale_color_manual(values=rev(scales::hue_pal()(degree_resolution+1)[1:degree_resolution])) + 
  theme_classic() + 
  scale_x_continuous(breaks=.5/2^(1:degree_resolution)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme_classic() +
  labs(x = "k")

stephenturner commented 3 years ago

@vpnagraj this is GTG by me. Take a look at one change I made in https://github.com/signaturescience/skater/commit/c874b406966514c66b3b7718ba358dc1a4ba90cb, see if this looks good to you.

Those hap-ibd tables can be hundreds of thousands of lines long. Reading in the whole table (twice), the first time just to check that nrow>0 isn't necessary. I changed this to read in a single line, and if the file is empty return a FALSE. I think the end result is the same.

some independent sanity checking

# Create an empty file and read it in
emptyfile <- tempfile()
file.create(emptyfile)

# Create a 500k row by 10 column file
bigfile <- tempfile()
readr::write_tsv(as.data.frame(matrix(rnorm(5e6), ncol=10)), file=bigfile)

# An empty file
# The new way
            length(readr::read_lines(emptyfile, n_max=1L))>0
system.time(length(readr::read_lines(emptyfile, n_max=1L))>0)
# The old way
            nrow(readr::read_tsv(emptyfile, col_names = FALSE))>0
system.time(nrow(readr::read_tsv(emptyfile, col_names = FALSE))>0)

# A big file
# The new way
            length(readr::read_lines(bigfile, n_max=1L))>0
system.time(length(readr::read_lines(bigfile, n_max=1L))>0)
# The old way
            nrow(readr::read_tsv(bigfile, col_names = FALSE))>0
system.time(nrow(readr::read_tsv(bigfile, col_names = FALSE))>0)

signaturescience / skater

edits to read_ibd and ibd2kin and read_map #45