signaturescience / skater

SKATE R Utilities
https://signaturescience.github.io/skater/
Other
9 stars 5 forks source link

edits to read_ibd and ibd2kin and read_map #45

Closed vpnagraj closed 3 years ago

vpnagraj commented 3 years ago

this PR brings in some critical changes to read_ibd(), ibd2kin(), and read_map():

fixes #37 and fixes #44

vpnagraj commented 3 years ago

code below demonstrates usage with simulated .seg data included in the package. this example also compares the impact of different genetic maps ("hapmap" vs "bherer") on the kinship coefficient retrieved from simulated segments:

library(tidyverse)
library(skater)

plinkmaps <- list(hapmap = read_map("/data/projects/skate/data/hapmap/plink.allchr.GRCh37.map", source = "hapmap"),
                  bherer = read_map("/data/projects/skate/data/pedsim_files/refined_mf.simmap", source = "bherer"))

## read in segment file from ped sim
## split so that returned value is a list
seglikeibd <- read_ibd(system.file("extdata", "GBR.sim.seg.gz", package="skater"),
                       source = "pedsim",
                       split = TRUE)

degree_resolution <- 3

truth <- read_fam(system.file("extdata", "GBR.sim.fam", package="skater")) %>%
  fam2ped() %>%
  mutate(x=map(ped, ped2kinpair)) %>%
  select(x) %>%
  unnest(x) %>%
  mutate(d.truth=kin2degree(k, max_degree=degree_resolution)) %>%
  rename(k.truth=k) %>%
  print()

res <- list()

for(map_name in names(plinkmaps)) {

  plinkmap <- plinkmaps[[map_name]]

  ibd1_dat <- ibd2kin(.ibd_data = seglikeibd$IBD1, .map = plinkmap, type = "IBD1") 
  ibd2_dat <- ibd2kin(.ibd_data = seglikeibd$IBD2, .map = plinkmap, type = "IBD2")

  res[[map_name]] <-
    bind_rows(ibd1_dat,ibd2_dat) %>%
    group_by(id1,id2) %>%
    summarise(kinship = sum(kinship)) %>%
    mutate(degree = kin2degree(kinship)) %>%
    left_join(truth) %>%
    mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated")))) %>%
    mutate(map = map_name)

}

res <- do.call("rbind", res)

res %>%
  ggplot(aes(kinship)) +
  geom_density(aes(fill = d.truth), alpha = 0.5) +
  geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) + 
  geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=rev(x), col=k.theoretical), lty=1, alpha=.6) +
  scale_color_manual(values=rev(scales::hue_pal()(degree_resolution+1)[1:degree_resolution])) + 
  theme_classic() + 
  scale_x_continuous(breaks=.5/2^(1:degree_resolution)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme_classic() +
  labs(x = "k") +
  facet_wrap(~ map, ncol = 1)

bherer_v_hapmap

vpnagraj commented 3 years ago

@stephenturner over to you for review on this one.

curious to hear your thoughts on the read_ibd() %>% ibd2kin() workflow for handling IBD1 vs IBD2.

it works, but feels a little clunky. only way i can think to improve beyond where its at now is to do some significant refactoring of internal code in ibd2kin(). would prefer to not do that ... but if the usage is too annoying then im up for it.

stephenturner commented 3 years ago

@stephenturner over to you for review on this one.

curious to hear your thoughts on the read_ibd() %>% ibd2kin() workflow for handling IBD1 vs IBD2.

it works, but feels a little clunky. only way i can think to improve beyond where its at now is to do some significant refactoring of internal code in ibd2kin(). would prefer to not do that ... but if the usage is too annoying then im up for it.

Let's walk through this tomorrow. I see what it's doing, and we don't need this code to be super elegant. Clunky? Maybe. "Works" and I'm satisfied for now.

vpnagraj commented 3 years ago

think ive addressed everything discussed in recent commits:

  1. read_map() no longer has source argument. expects "Plink" map input format.
  2. read_ibd() will always split if source == "pedsim"
  3. read_ibd() now actually returns an empty tibble (instead of a tibble with one row of NAs) given empty file input

examples for running read_ibd() %>% ibd2kin() ... plots not shown ... for simulated .seg:

library(tidyverse)
library(skater)

plinkmap <- read_map("/data/projects/skate/data/pedsim_files/averaged.simmap")

## read in segment file from ped sim
seglikeibd <- read_ibd(system.file("extdata", "GBR.sim.seg.gz", package="skater"), source = "pedsim")

degree_resolution <- 3

truth <- read_fam(system.file("extdata", "GBR.sim.fam", package="skater")) %>%
  fam2ped() %>%
  mutate(x=map(ped, ped2kinpair)) %>%
  select(x) %>%
  unnest(x) %>%
  mutate(d.truth=kin2degree(k, max_degree=degree_resolution)) %>%
  rename(k.truth=k) %>%
  print()

ibd1_dat <- ibd2kin(.ibd_data = seglikeibd$IBD1, .map = plinkmap, type = "IBD1") 
ibd2_dat <- ibd2kin(.ibd_data = seglikeibd$IBD2, .map = plinkmap, type = "IBD2")

res <-
  bind_rows(ibd1_dat,ibd2_dat) %>%
  group_by(id1,id2) %>%
  summarise(kinship = sum(kinship)) %>%
  mutate(degree = kin2degree(kinship)) %>%
  left_join(truth) %>%
  mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated"))))

res %>%
  ggplot(aes(kinship)) +
  geom_density(aes(fill = d.truth), alpha = 0.5) +
  geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) + 
  geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=rev(x), col=k.theoretical), lty=1, alpha=.6) +
  scale_color_manual(values=rev(scales::hue_pal()(degree_resolution+1)[1:degree_resolution])) + 
  theme_classic() + 
  scale_x_continuous(breaks=.5/2^(1:degree_resolution)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme_classic() +
  labs(x = "k")

and for hapibd segments:

## read in segment file from hapibd
hapibd <- read_ibd(system.file("extdata", "GBR.sim.ibd.gz", package="skater"), source = "hapibd")

ibd_dat <- ibd2kin(.ibd_data = hapibd, .map = plinkmap) 

res <-
  ibd_dat %>%
  mutate(degree = kin2degree(kinship)) %>%
  left_join(truth) %>%
  mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated"))))

res %>%
  ggplot(aes(kinship)) +
  geom_density(aes(fill = d.truth), alpha = 0.5) +
  geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) + 
  geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=rev(x), col=k.theoretical), lty=1, alpha=.6) +
  scale_color_manual(values=rev(scales::hue_pal()(degree_resolution+1)[1:degree_resolution])) + 
  theme_classic() + 
  scale_x_continuous(breaks=.5/2^(1:degree_resolution)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme_classic() +
  labs(x = "k")
stephenturner commented 3 years ago

@vpnagraj this is GTG by me. Take a look at one change I made in https://github.com/signaturescience/skater/commit/c874b406966514c66b3b7718ba358dc1a4ba90cb, see if this looks good to you.

Those hap-ibd tables can be hundreds of thousands of lines long. Reading in the whole table (twice), the first time just to check that nrow>0 isn't necessary. I changed this to read in a single line, and if the file is empty return a FALSE. I think the end result is the same.

some independent sanity checking

# Create an empty file and read it in
emptyfile <- tempfile()
file.create(emptyfile)

# Create a 500k row by 10 column file
bigfile <- tempfile()
readr::write_tsv(as.data.frame(matrix(rnorm(5e6), ncol=10)), file=bigfile)

# An empty file
# The new way
            length(readr::read_lines(emptyfile, n_max=1L))>0
system.time(length(readr::read_lines(emptyfile, n_max=1L))>0)
# The old way
            nrow(readr::read_tsv(emptyfile, col_names = FALSE))>0
system.time(nrow(readr::read_tsv(emptyfile, col_names = FALSE))>0)

# A big file
# The new way
            length(readr::read_lines(bigfile, n_max=1L))>0
system.time(length(readr::read_lines(bigfile, n_max=1L))>0)
# The old way
            nrow(readr::read_tsv(bigfile, col_names = FALSE))>0
system.time(nrow(readr::read_tsv(bigfile, col_names = FALSE))>0)