signaturescience / skater

SKATE R Utilities
https://signaturescience.github.io/skater/
Other
9 stars 5 forks source link

estimate kinship from IBD segments (both simulated and detected) #37

Closed stephenturner closed 3 years ago

stephenturner commented 3 years ago

During the call A.W. brought up an idea that sounds reasonable -- comparing the calculated relatedness (kinship) against the actual simulated kinship based on the ibd segments ped-sim simulates, not just the expected kinship from the fam file. I think this can be done with the .seg file output by ped-sim. This is documented here: https://github.com/williamslab/ped-sim#output-ibd-segments-file.

I'm attaching a .fam file and a .seg file (compressed) from a simple 5 generation pedigree (not simulating any genetic data) using a pedigree definition that looks like this:

def 5gens 1 5
1 1
2 2
3 2
4 2
5 2

With a simple ped-sim simulation (that map and XO interference file are in /data/skate/skatesim):

ped-sim --fam -d 5gens.def -m pedsim/refined_mf.simmap -o 5gens --intf pedsim/sex_av_nu_p_campbell.tsv

This is the pedigree it produces:

image

You can make that plot and then count the number of degree relationships with a little skater code:

library(tidyverse)
library(skater)

ped <- read_fam("5gens-everyone.fam.gz") %>%
  fam2ped()

# plot pedigree
ped %>%
  pull(ped) %>%
  pluck(1) %>%
  kinship2::plot.pedigree(id=rep(NA, length(.$id)))

# Count types of relationship pairs
ped %>% 
  mutate(x=map(ped, ped2kinpair)) %>%
  unnest(x) %>%
  mutate(degree=kin2degree(k, max_degree=8L)) %>%
  count(k, degree) %>%
  arrange(degree)
# A tibble: 9 x 3
        k degree     n
    <dbl>  <int> <int>
1 0.5          0    24
2 0.25         1    44
3 0.125        2    44
4 0.0625       3    36
5 0.0312       4    28
6 0.0156       5    12
7 0.00781      6     8
8 0.00391      7     4
9 0           NA   100

You can also read in that .seg file pretty easily as well. I started summing up the IBD1/2 segments, but not sure this makes sense.

seg <- read_tsv("5gens.seg.gz", 
                col_types="ccciicddd",
                col_names=c("id1", "id2", "chr", "pstart", "pend", "type", "gstart", "gend", "glength"))
seg %>%
  group_by(chr, id1, id2, type) %>%
  summarize(s=sum(glength), .groups="drop") %>%
  spread(type, s, fill=0)

@vpnagraj I'd like to assign this to you to at least take a look at. I think you did something very similar for getting k from hap-ibd, over at https://github.com/signaturescience/skatebench/blob/a6d6c6dc64c05639bfbd42fe9b9cadcccf529d3b/hap-ibd-test/hap-ibd-relatedness.R. If you have any ideas about how to get something working push up to a new branch, and maybe include the .seg and .fam in inst/extdata to use as examples. cc @genignored

vpnagraj commented 3 years ago

@stephenturner some progress here. the helpers ive been working with seem to do the trick. long thread with all the code ...

first define the two helper functions (eventually to live in skater?):

## helper function to interpolate over segments
## inspired by py code http://faculty.washington.edu/sguy/ibd_relatedness.html
interpolate <- function(ibd_bp, chromgpos) {

  tmp_set1 <- 
    chromgpos %>%
    dplyr::filter(bp < ibd_bp)
  tmp_set2 <- 
    chromgpos %>%
    dplyr::filter(bp > ibd_bp)

  if(nrow(tmp_set1) > 0 && nrow(tmp_set2) > 0) {
    tmp_bp1 <- max(tmp_set1$bp)
    tmp_bp2 <- min(tmp_set2$bp)
  } else if (nrow(tmp_set1) > 0) {
    tmp_bp2 <- max(tmp_set1$bp) 
    tmp_set1 <- 
      tmp_set1 %>%
      dplyr::filter(bp != tmp_bp2)
    tmp_bp1 <- max(tmp_set1$bp)
  } else {
    tmp_bp1 <- min(tmp_set2$bp)
    tmp_set2 <-
      tmp_set2 %>%
      dplyr::filter(bp != tmp_bp1)
    tmp_bp2 <- min(tmp_set2$bp)
  }

  if (tmp_bp1 == tmp_bp2) {
    res <- 
      chromgpos %>%
      filter(bp == tmp_bp1) %>%
      pull(value)
    return(res)
  } else {

    len1 <- 
      chromgpos %>%
      filter(bp == tmp_bp1) %>%
      pull(value)

    len2 <-
      chromgpos %>%
      filter(bp == tmp_bp2) %>%
      pull(value)

    res <- len1+(len2-len1)*(ibd_bp - tmp_bp1)/(tmp_bp2-tmp_bp1)
    return(res)
  }
}

## function to loop over each chromosome and compute total length
## use 4 times that total length as denominator for pairwise shared length to get kinship coefficient
ibd2kin <- function(ibd_data, plinkmap) {

  ## get the min base pair for each chr
  ibd_min_bp <-
    ibd_data %>%
    group_by(chr) %>%
    summarise(bp = min(start), .groups = "drop")

  ## get the max base pair for each chromsome
  ibd_max_bp <-
    ibd_data  %>%
    group_by(chr) %>%
    summarise(bp = max(end), .groups = "drop")

  ## loop over each chromosome to get length
  ## this will be summed up and serve as denominator for kinship coefficient calculations
  totalchromlength <- vector()
  for(i in 1:length(unique(ibd_min_bp$chr))) {

    chrom <- unique(ibd_min_bp$chr)[i]

    ## these steps just get the min/max bp and map data for the given chromosome
    tmp_ibd_min_bp <-
      ibd_min_bp %>%
      filter(chr == chrom)

    tmp_ibd_max_bp <-
      ibd_max_bp %>%
      filter(chr == chrom)

    tmp_chromgpos <-
      plinkmap %>%
      filter(chr == chrom)

    ## use interpolate to get centimorgan lengths
    mincm <- interpolate(ibd_bp = tmp_ibd_min_bp$bp, chromgpos = tmp_chromgpos)
    maxcm <- interpolate(ibd_bp = tmp_ibd_max_bp$bp, chromgpos = tmp_chromgpos)

    ## create chrom length and append to vector
    tmp_totalchromlength <- maxcm - mincm
    totalchromlength <- c(totalchromlength, tmp_totalchromlength)
  }

  ## use ibd data and total chromlength as denominator above to get the kinship value
  ibd_data %>% 
    group_by(id1,id2) %>% 
    summarise(totallength = sum(length), .groups = "drop") %>%
    mutate(kinship = totallength/(4*sum(totalchromlength)))
}

now using the same data you shared above calculate the kinship coefficient from the IBD segments and compare to theoretical kinship from "truth" data:

## read in fam
truth <- read_fam("5gens-everyone.fam.gz") %>%
  fam2ped() %>%
  mutate(x=map(ped, ped2kinpair)) %>%
  select(x) %>%
  unnest(x) %>%
  mutate(d.truth=kin2degree(k, max_degree=8)) %>%
  rename(k.truth=k) %>%
  print()

## read in seg and convert to "ibd" output (i.e. similar to hap-ibd output)
seg <- read_tsv("5gens.seg.gz", 
                col_types="ccciicddd",
                col_names=c("id1", "id2", "chr", "pstart", "pend", "type", "gstart", "gend", "glength"))

seglikeibd <-
  seg %>%
  filter(type == "IBD1" | type == "IBD2") %>%
  select(id1, id2, chr, start = pstart, end = pend, length = glength)

## read in map file
plinkmap <- 
  read_delim("/data/projects/skate/data/hapmap/plink.allchr.GRCh37.map", delim = "\t", col_names = FALSE) %>% 
  select(chr = 1, value = 3, bp = 4)

## join up with truth data and see how we did
joined <-
  ibd2kin(ibd_data = seglikeibd, plinkmap = plinkmap) %>%
  mutate(degree = kin2degree(kinship)) %>%
  left_join(truth) %>%
  mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated"))))

## look at first 10 obs
joined %>%
  head(10) %>%
  knitr::kable()
id1 id2 totallength kinship degree k.truth d.truth
5gens1_g1-b1-i1 5gens1_g2-b1-i1 3346.2989 0.2363204 1 0.2500 1
5gens1_g1-b1-i1 5gens1_g2-b1-i2 3346.2989 0.2363204 1 0.2500 1
5gens1_g1-b1-i1 5gens1_g2-b2-i1 3346.2989 0.2363204 1 0.2500 1
5gens1_g1-b1-i1 5gens1_g2-b2-i2 3346.2989 0.2363204 1 0.2500 1
5gens1_g1-b1-i1 5gens1_g3-b1-i1 1823.3637 0.1287685 2 0.1250 2
5gens1_g1-b1-i1 5gens1_g3-b1-i2 1846.1699 0.1303791 2 0.1250 2
5gens1_g1-b1-i1 5gens1_g3-b2-i1 1815.5866 0.1282193 2 0.1250 2
5gens1_g1-b1-i1 5gens1_g3-b2-i2 1744.5802 0.1232047 2 0.1250 2
5gens1_g1-b1-i1 5gens1_g4-b1-i1 887.1510 0.0626519 3 0.0625 3
5gens1_g1-b1-i1 5gens1_g4-b1-i2 876.6621 0.0619111 3 0.0625 3

questions

  1. what else do we need to do to confirm that this is "working"?
  2. is the "IBD1 or IBD2" logic appropriate to add up segments from the ped-sim .seg file (cc @genignored ) ? what about the HBD segments that are in the .seg file ... ignore those?
  3. when/how do we want to pull the helpers into skater?
stephenturner commented 3 years ago

Great progress!

Suggested As to Qs also @genignored please take a look.

  1. Let's get a density plot similar to https://github.com/signaturescience/skatebench/blob/c5748f7d594f6734c708608bc281bf9a5e24459e/scratch/eda-akt-ibis/eda-akt-ibis.Rmd#L308-L315, plotting kinship, or a boxplot with kinship~factor(k.truth) or similar. I would expect kinship to be somewhat normally distributed and centered around its respective k.truth.
  2. @genignored
  3. Now/PR - put the code in R/kinfromseg.R or whatever. Thing I don't know here is what to do about the map. I don't want this package to go from <1MB to >75MB. Compressed it's still 23mb. Could we put the compressed concatenated genetic map into a different repo, make it public, then add a function that'll download the genetic map from a specified location, defaulting to raw.github.whatever/master/plink.allchr.GRCh37.map?
vpnagraj commented 3 years ago
  1. Let's get a density plot ...

good idea. i'll put that together.

  1. ... Thing I don't know here is what to do about the map ...

how about this. we put together a read_map function (can read in the hapmap concatenated map) to live in skater and then document how to retrieve / prep the map file. im a little hesitant to redistribute the map file in a public repo. but maybe ok? would need to read fine print on license

stephenturner commented 3 years ago

Good call on latter. Further thoughts:

We could easily provide instructions for downloading/concatenating/formatting, but having it already formatted/concatenated would be nice.

stephenturner commented 3 years ago

closed via #43

stephenturner commented 3 years ago

Reopening based on discussion in #43, especially in https://github.com/signaturescience/skater/pull/43#issuecomment-834485968.

@genignored's screenshot from the Ramstetter paper:

image

So I think this is a fairly important distinction here, and I think this could explain several of the issues we're seeing.

First, the bimodal distribution of kinship coefficients we see simulated in 1st degree relatives. Parent offspring (PO) and Full Sibling (FS) should both have kinship coefficient of 0.25. That's because PO share 1 allele IBD at 100 % of the genome (so 1/4=.25), while FS share on average 25% IBD2, 50% IBD1, and 25% IBD0 (so, .25/2+.5/4 = .25).

If the IBD2 proportion is being divided by 4 the same way the IBD1 proportion (correctly) is being divided by 4, you'd get .25/4+.5/4=.1875, which looks exactly where the FS pairs are (erroneously) falling below:

image

This also explains why we see this pattern comparing detected kinship coefficient using KING (Y axis) against simulated kinship by running the function in #43 on the pedsim outputs. KING correctly sees that the kinship coefficient is .25 for both PO and FS relationships, but the FS relationship has a lower simulated kinship:

image

Looking at IBIS kinship (y) versus simulated kinship (x) the pattern isn't as apparent at zero error, but it's there. The "dot" at the upper right are the PO pairs, while the red diagonal leading up to it are the FSs? When you start to introduce more error you see the same pattern that you saw above with KING - the separation in detected FS versus simulated FS kinship coefficient.

image

Finally, this explains the left-shift of the distribution to where the modes aught to be. Besides full sibling relationships, most other relationships share vanishingly little IBD2, if any. It's my hunch that it's this this tiny little bit of IBD2 proportion that's being divided by 4 instead of being divided by 2 that's very slightly left-shifting the distributions we see.

If I were to guess where this is happening in the code, it's here:

https://github.com/signaturescience/skater/blob/ad89f9a6506c63343f119a20c766d69b1f07d3d8/R/ibd.R#L96

From the description in Ramstetter copied below once more, k1ij and k2ij are the sum of the genetic lengths shared IBD1 and IBD2, respectively, divided by the total genetic lengths. Kinship coefficient is k1ij/4 + k2ij/2, but the line of code above is suggesting that there is no distinction being made between ibd1 and ibd2 proportions, which should be handled differently. That is, I think IBD2 proportions need to be summed separately then replace the 4 with a 2, above, then the IBD 1 proportions should use the /4 as above. Then you add those together, which should give you a correct k.

image

@vpnagraj can you take a crack at this?

vpnagraj commented 3 years ago

agreed. this is an important thread to disentangle. ive started looking into this but will need help from you @stephenturner and @genignored to square exactly what is going on here.

first thought is to return to the skater vs IBDKin comparison. the example above is using the simulated .seg from ped-sim. clearly something wrong there ... both in terms of the left shift AND the bimodal first degree. but when we look at how skater::ibd2kin() works with hap-ibd segments, i am not seeing the same problem (code at https://github.com/signaturescience/skater/pull/43#issuecomment-829614886):

skater_v_ibdkin

why?

that makes me a little hesitant to crack into ibd2kin() and start changing the algorithm. that said, im pretty sure i follow your logic and the note from the ramstetter paper and agree we could handle IBD1 and IBD2 differently.

that will require edits to read_ibd() and ibd2kin(). could get messy. so how do we do that in an exploratory way? well, one thought is to revisit the density plot (code at https://github.com/signaturescience/skater/pull/43#issuecomment-831468766):

density_plot1

what if we tried to recreate that but by running the algorithm separately on IBD1 and IBD2 ... then summing the kinship coefficients together. requires some redundant, ugly code (see below the image):

desnity_plot2

is this working?!! the bimodal distribution shifts to the right (a few with k > 0.25) ... but overall first degree is still left-shifted.

other thoughts?

ibd1_seg <-
  readr::read_tsv(system.file("extdata", "GBR.sim.seg.gz", package="skater"),
                  col_types="ccciicddd",
                  col_names=c("id1", "id2", "chr", "pstart", "pend", "type", "gstart", "gend", "glength")) %>%
  dplyr::filter(type == "IBD1") %>%
  dplyr::select(id1, id2, chr, start = pstart, end = pend, length = glength) %>%
  ## make sure ids are ordered
  arrange_ids(id1,id2)

ibd2_seg <-
  readr::read_tsv(system.file("extdata", "GBR.sim.seg.gz", package="skater"),
                  col_types="ccciicddd",
                  col_names=c("id1", "id2", "chr", "pstart", "pend", "type", "gstart", "gend", "glength")) %>%
  dplyr::filter(type == "IBD2") %>%
  dplyr::select(id1, id2, chr, start = pstart, end = pend, length = glength) %>%
  ## make sure ids are ordered
  arrange_ids(id1,id2)

## JUST EXPLORING HERE
## IN PRACTICE THIS LOGIC WOULD LIVE IN ibd2kin AS OPPOSED TO ITS OWN FUN
ibd2kin_ibd2 <- function(.ibd_data, .map) {
  ## get the min base pair for each chr
  ibd_min_bp <-
    .ibd_data %>%
    dplyr::group_by(chr) %>%
    dplyr::summarise(bp = min(start), .groups = "drop")

  ## get the max base pair for each chromsome
  ibd_max_bp <-
    .ibd_data  %>%
    dplyr::group_by(chr) %>%
    dplyr::summarise(bp = max(end), .groups = "drop")

  ## loop over each chromosome to get length
  ## this will be summed up and serve as denominator for kinship coefficient calculations

  ## some preprocessing on map for computational efficiency ...
  ## group_by and group_split on chr
  ## do that separately so you can get the chr names for indexing
  tmp <-
    .map %>%
    dplyr::group_by(chr)

  tmp_l <-
    tmp %>%
    dplyr::group_split() %>%
    ## get name for each list element from corresponding group key
    ## NOTE: group_key returns a tibble so you need to get chr column out as vec
    purrr::set_names(dplyr::group_keys(tmp)$chr)

  ## cleanup
  rm(tmp)

  totalchromlength <- vector()
  for(i in 1:length(unique(ibd_min_bp$chr))) {

    chrom <- unique(ibd_min_bp$chr)[i]

    ## these steps just get the min/max bp and map data for the given chromosome
    tmp_ibd_min_bp <-
      ibd_min_bp %>%
      dplyr::filter(chr == chrom)

    tmp_ibd_max_bp <-
      ibd_max_bp %>%
      dplyr::filter(chr == chrom)

    tmp_chromgpos <- tmp_l[[chrom]]

    ## use interpolate to get centimorgan lengths
    mincm <- skater:::interpolate(ibd_bp = tmp_ibd_min_bp$bp, chromgpos = tmp_chromgpos)
    maxcm <- skater:::interpolate(ibd_bp = tmp_ibd_max_bp$bp, chromgpos = tmp_chromgpos)

    ## create chrom length and append to vector
    tmp_totalchromlength <- maxcm - mincm
    totalchromlength <- c(totalchromlength, tmp_totalchromlength)
  }

  ## use ibd data and total chromlength as denominator above to get the kinship value
  ## NOTE: the total length of genome is multiplied by 4 to get the kinship coefficient "units"
  ## ... i.e. the probability that a randomly selected allele will be shared between two individuals
  .ibd_data %>%
    dplyr::group_by(id1,id2) %>%
    dplyr::summarise(totallength = sum(length), .groups = "drop") %>%
    dplyr::mutate(kinship = totallength/(2*sum(totalchromlength))) %>%
    dplyr::select(id1,id2,kinship) %>%
    arrange_ids(id1, id2)
}

ibd1_dat <- ibd2kin(.ibd_data = ibd1_seg, .map = plinkmap) 
ibd2_dat <- ibd2kin_ibd2(.ibd_data = ibd2_seg, .map = plinkmap)

res <-
  bind_rows(ibd1_dat,ibd2_dat) %>%
  group_by(id1,id2) %>%
  summarise(kinship = sum(kinship)) %>%
  mutate(degree = kin2degree(kinship)) %>%
  left_join(truth) %>%
  mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated")))) 

res %>%
  ggplot(aes(kinship)) +
  geom_density(aes(fill = d.truth), alpha = 0.5) +
  geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) + 
  geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=x, col=k.theoretical), lty=1, alpha=.6) +
  scale_color_manual(values=scales::hue_pal()(degree_resolution+1)[1:degree_resolution]) + 
  theme_classic() + 
  scale_x_continuous(breaks=.5/2^(1:degree_resolution)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme_classic() +
  labs(x = "k")
stephenturner commented 3 years ago

Ok, if I see what you're doing here, you're separately processing ibd2 and ibd1 segments, where the new function is just a copy/paste of the original but using a 2 in the denominator instead of a 4. You're calculating the sum of ibd{1,2} over the total genomic length, separately, then adding them together. This sound correct?

If so, I think first, it seems MUCH better that we've got that second FS mode gone, where all its density is shifted into the primary mode that's just to the left of .25. This is good.

Second, I still really believe that part of the shift might be due to the difference in simulating with SS versus detection with SS. In the vein of JBSH's "an ounce of algebra is worth a ton of verbal argument" let's do the experiment here. @genignored - can you generate a single zero-error zero missing simulation, outputting the fam file and simulated IBD segments (VCF optional), using the same sex-averaged genetic map we're using for detection in IBIS/hap-IBD? NOT the average of the SS map, but the plink/hapmap SA map we're using everywhere else, including in the ibd2kin function?

Let's see if this fixes the left shift discrepancy between simulation and detection.

vpnagraj commented 3 years ago

Ok, if I see what you're doing here, you're separately processing ibd2 and ibd1 segments, where the new function is just a copy/paste of the original but using a 2 in the denominator instead of a 4. You're calculating the sum of ibd{1,2} over the total genomic length, separately, then adding them together. This sound correct?

yep. exactly that. i was trying to avoid refactoring the read_ibd AND ibd2kin code until we can prove that this will fix the problem. hence all the copy/paste for now.

genignored commented 3 years ago

@vpnagraj, in response to Stephen's request, where is a copy of the SA map file, and is it in the same format as the sex-specific map used by ped-sim?

vpnagraj commented 3 years ago

@genignored its the same one you are pointing to in the snakefile:

https://github.com/signaturescience/skatebench/blob/mscholz/snakefiles/parameter_sweep.config#L10

not in the same format as the SS map!

FWIW i think the average of the SS map will get us close to what we need (see full code at bottom of comment):

density_plot3

im going to start updating logic in read_ibd and ibd2kin to conditionally treat IBD1 and IBD2 with different denominators when calculating kinship coefficient.

plinkmap <- 
  read_delim("/data/projects/skate/data/pedsim_files/refined_mf.simmap", delim = "\t") %>%
  mutate(cm = (male_cM + female_cM) / 2) %>%
  select(chr = 1, value = cm, bp = pos)

ibd1_seg <-
  readr::read_tsv(system.file("extdata", "GBR.sim.seg.gz", package="skater"),
                  col_types="ccciicddd",
                  col_names=c("id1", "id2", "chr", "pstart", "pend", "type", "gstart", "gend", "glength")) %>%
  dplyr::filter(type == "IBD1") %>%
  dplyr::select(id1, id2, chr, start = pstart, end = pend, length = glength) %>%
  ## make sure ids are ordered
  arrange_ids(id1,id2)

ibd2_seg <-
  readr::read_tsv(system.file("extdata", "GBR.sim.seg.gz", package="skater"),
                  col_types="ccciicddd",
                  col_names=c("id1", "id2", "chr", "pstart", "pend", "type", "gstart", "gend", "glength")) %>%
  dplyr::filter(type == "IBD2") %>%
  dplyr::select(id1, id2, chr, start = pstart, end = pend, length = glength) %>%
  ## make sure ids are ordered
  arrange_ids(id1,id2)

## JUST EXPLORING HERE
## IN PRACTICE THIS LOGIC WOULD LIVE IN ibd2kin AS OPPOSED TO ITS OWN FUN
ibd2kin_ibd2 <- function(.ibd_data, .map) {
  ## get the min base pair for each chr
  ibd_min_bp <-
    .ibd_data %>%
    dplyr::group_by(chr) %>%
    dplyr::summarise(bp = min(start), .groups = "drop")

  ## get the max base pair for each chromsome
  ibd_max_bp <-
    .ibd_data  %>%
    dplyr::group_by(chr) %>%
    dplyr::summarise(bp = max(end), .groups = "drop")

  ## loop over each chromosome to get length
  ## this will be summed up and serve as denominator for kinship coefficient calculations

  ## some preprocessing on map for computational efficiency ...
  ## group_by and group_split on chr
  ## do that separately so you can get the chr names for indexing
  tmp <-
    .map %>%
    dplyr::group_by(chr)

  tmp_l <-
    tmp %>%
    dplyr::group_split() %>%
    ## get name for each list element from corresponding group key
    ## NOTE: group_key returns a tibble so you need to get chr column out as vec
    purrr::set_names(dplyr::group_keys(tmp)$chr)

  ## cleanup
  rm(tmp)

  totalchromlength <- vector()
  for(i in 1:length(unique(ibd_min_bp$chr))) {

    chrom <- unique(ibd_min_bp$chr)[i]

    ## these steps just get the min/max bp and map data for the given chromosome
    tmp_ibd_min_bp <-
      ibd_min_bp %>%
      dplyr::filter(chr == chrom)

    tmp_ibd_max_bp <-
      ibd_max_bp %>%
      dplyr::filter(chr == chrom)

    tmp_chromgpos <- tmp_l[[chrom]]

    ## use interpolate to get centimorgan lengths
    mincm <- skater:::interpolate(ibd_bp = tmp_ibd_min_bp$bp, chromgpos = tmp_chromgpos)
    maxcm <- skater:::interpolate(ibd_bp = tmp_ibd_max_bp$bp, chromgpos = tmp_chromgpos)

    ## create chrom length and append to vector
    tmp_totalchromlength <- maxcm - mincm
    totalchromlength <- c(totalchromlength, tmp_totalchromlength)
  }

  ## use ibd data and total chromlength as denominator above to get the kinship value
  ## NOTE: the total length of genome is multiplied by 4 to get the kinship coefficient "units"
  ## ... i.e. the probability that a randomly selected allele will be shared between two individuals
  .ibd_data %>%
    dplyr::group_by(id1,id2) %>%
    dplyr::summarise(totallength = sum(length), .groups = "drop") %>%
    dplyr::mutate(kinship = totallength/(2*sum(totalchromlength))) %>%
    dplyr::select(id1,id2,kinship) %>%
    arrange_ids(id1, id2)
}

ibd1_dat <- ibd2kin(.ibd_data = ibd1_seg, .map = plinkmap) 
ibd2_dat <- ibd2kin_ibd2(.ibd_data = ibd2_seg, .map = plinkmap)

res <-
  bind_rows(ibd1_dat,ibd2_dat) %>%
  group_by(id1,id2) %>%
  summarise(kinship = sum(kinship)) %>%
  mutate(degree = kin2degree(kinship)) %>%
  left_join(truth) %>%
  mutate(across(starts_with("d"), ~factor(replace_na(., "Unrelated")))) 

res %>%
  ggplot(aes(kinship)) +
  geom_density(aes(fill = d.truth), alpha = 0.5) +
  geom_vline(xintercept=2^(-seq(3,(degree_resolution*2+1+2),2)/2), lty=3) + 
  geom_vline(data=tibble(x=.5/2^(1:degree_resolution), k.theoretical = factor(rev(x))), aes(xintercept=x, col=k.theoretical), lty=1, alpha=.6) +
  scale_color_manual(values=scales::hue_pal()(degree_resolution+1)[1:degree_resolution]) + 
  theme_classic() + 
  scale_x_continuous(breaks=.5/2^(1:degree_resolution)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme_classic() +
  labs(x = "k")
genignored commented 3 years ago

For simulation with hapmap map, there are 2 changes: 1) Reformatted plink.allchromosomes.GRCh37.map to correct format (chromosome\tposition\tcM) 2) Maps w/o sex specific rates cannot use interference model (ped-sim), so using --pois instead

Simulation complete:

$ pwd
/data/projects/skate/data/testing/hapmap_simulation

$ped-sim --err_rate 0 --miss_rate=0 --fam --bp -i ASW-specific.vcf.gz -m pedsim.plink.reformatted.GRCh37.map -o ASW-hapmapMap-5GHS-er_0-ms_0-hom_0 -d 5GHS.def --pois

$ ls ASW-hapmapMap-5GHS-er_0-ms_0-hom_0*
ASW-hapmapMap-5GHS-er_0-ms_0-hom_0.bp            ASW-hapmapMap-5GHS-er_0-ms_0-hom_0.log  ASW-hapmapMap-5GHS-er_0-ms_0-hom_0.vcf.gz
ASW-hapmapMap-5GHS-er_0-ms_0-hom_0-everyone.fam  ASW-hapmapMap-5GHS-er_0-ms_0-hom_0.seg
vpnagraj commented 3 years ago

fix for this issue is proposed in #45 .

comment at https://github.com/signaturescience/skater/pull/45#issuecomment-839140316 walks through an example of the proposed API