signaturescience / skater

SKATE R Utilities
https://signaturescience.github.io/skater/
Other
9 stars 4 forks source link

read_ibd dies on empty file #44

Closed genignored closed 3 years ago

genignored commented 3 years ago

when hap-ibd does not detect any ibd segments, it outputs an empty .ibd.gz file.

Currently seg2kin/read_ibd() dies on empty files:

hap-ibd/GBR-gsa37-5GHS-er_0.2-ms_0-hom_0.out.ibd.gz
Error: Can't subset columns that don't exist.
✖ Locations 1, 3, 5, 6, 7, etc. don't exist.
ℹ There are only 0 columns.
Timing stopped at: 4.321 4.421 66.47
Execution halted

So, 1) The exception needs to be caught (if no lines, or no data on line, skip line?) 2) does .relkin need to have any lines for other analysis to work, if so, do we need to start also including fam file and outputting all relationships?

vpnagraj commented 3 years ago

ack. good catch @genignored

yes, the long-term fix here will be to edit read_ibd to check for an empty tibble after the call to read_delim internally (see https://github.com/signaturescience/skater/blob/master/R/read.R#L178-L181). if empty, then dont proceed to select %>% arrange_ids etc.

in the short term we can get around this with the following (based on code at https://github.com/signaturescience/skatesim/blob/mscholz/snakefiles/calculate_kin_skate.R):

library(tidyverse)
library(skater)
library(furrr)

fps <- snakemake@input[['sims']]

plinkmap <- read_map(snakemake@input[['map']])

seg2kin <- function(fp, .map) {
  message(fp)
  ## check for content in ibd file
  ## is there at least 1 row data in the file ?!
  ind <- read_delim(fp, delim = "\t") %>% nrow() > 0
  if(ind) {
    outpath <- paste0(tools::file_path_sans_ext(fp), snake@params[['suffix']])
    seglikeibd <- read_ibd(fp, source = snakemake@params[['source']])
    ## just id1,id2,kin
    ibd2kin(.ibd_data = seglikeibd, .map = .map) %>%
      write_csv(., outpath)
  } else {
    message("IBD segment file appears empty ... skipping.")
  }
}
plan(multisession, workers = 24)
system.time({
  future_map(fps, seg2kin, .map = plinkmap)
})
stephenturner commented 3 years ago

would a mod to read to check for nrow>0 work? As for results I think I'd prefer to have all (eg 432) relkin results files, even if there's a header with zero rows. I can modify this, but around here I'm ensuring that if I read 432 fam files, that I also read 432 akt/king results, and I read 432 ibis results.

https://github.com/signaturescience/skatebench/blob/e34ad54d71a0d89055af76e5f5ae9143ff0e5ec1/scratch/eda-akt-ibis/eda-akt-ibis2.Rmd#L146-L150

It's okay if those files are empty (header only), but not okay if they're not there at all.

vpnagraj commented 3 years ago

45 will bring in an update to read_ibd() to issue a message if the inferred segment IBD.gz file is empty and return a tibble filled with NA at each column:

The hapibd input appears empty. Creating empty IBD tibble.
# A tibble: 1 x 6
  id1   id2   chr   start end   length
  <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> 
1 NA    NA    NA    NA    NA    NA    

let me know if that behavior works or if you all would prefer a different behavior (i.e. returning a truly empty tibble with NULL values)

vpnagraj commented 3 years ago

as of 41ffc7a the above now returns:

The hapibd input appears empty. Creating empty IBD tibble.
# A tibble: 0 x 6
# … with 6 variables: id1 <chr>, id2 <chr>, chr <chr>, start <chr>, end <chr>, length <chr>