Alterations to read_fam

genignored commented 3 years ago

We now have access to several dbGaP datasets. For the Utah Families, the pedigree file is structured as:

# Study accession: phs001872.v1.p1
# Table accession: pht009364.v1.p1
# Consent group: All
# Citation instructions: The study accession (phs001872.v1.p1) is used to cite the study and its data tables and documents. The data in this file should be cited using the accession pht009364.v1.p1.
# To cite columns of data within this file, please use the variable (phv#) accessions below:
#
# 1) the table name and the variable (phv#) accessions below; or
# 2) you may cite a variable as phv#.v1.p1

##  phv00417391.v1.p1   phv00417392.v1.p1   phv00417393.v1.p1   phv00417394.v1.p1   phv00417395.v1.p1   phv00417396.v1.p1
dbGaP_Subject_ID    FAMILY_ID   SUBJECT_ID  MOTHER  FATHER  SEX MZ_TWIN_ID
3005310 1   1   4   2   2   
90572   1   2   9   10  1   
[...]

which looks to be compatible with plink format. However, read_fam returns problems from reading this structure:

problems(famUtah)
# A tibble: 621 x 5
     row col      expected   actual     file                                                            
   <int> <chr>    <chr>      <chr>      <chr>                                                           
 1     1 NA       6 columns  4 columns  'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'
 2     2 NA       6 columns  4 columns  'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'
 3     3 NA       6 columns  4 columns  'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'
 4     4 sex      an integer study      'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'
 5     4 affected an integer accession  'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'
 6     4 NA       6 columns  31 columns 'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'
 7     5 sex      an integer of         'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'
 8     5 affected an integer data       'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'
 9     5 NA       6 columns  16 columns 'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'
10     6 NA       6 columns  1 columns  'Utah/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt'

I am unclear if skipping commented lines would solve all problems. It may also be a delimiter issue.

genignored commented 3 years ago

gzipped version of pedigree file is:

/data/projects/skate/data/phs001872/PhenotypeFiles/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt.gz

stephenturner commented 3 years ago

Failure is no surprise here - this file isn't formatted as a PLINK-formatted .fam file. There are a few ways to handle this:

One option would be to write a new function in skater, something like read_gapfam() or something, that would specifically read in this type of file into the same structure that read_fam() creates (enabling further processing with other skater functions). If this kind of file was a standard kind of file, or something we'll be reading in routinely (e.g., over hundreds of simulations), I might advocate harder for this function. But, my guess is we'd read this in once, do the post-processing with fam2ped etc, and be done with it.
Use some unix grep and awk to strip out the header and print the columns needed in the order needed for read_fam() to work unmodified.
(Preferred, I think): don't add to skater package just yet - just write a regular read_tsv() function and a pipe or two to handle this one-off case.

If you agree with #-3 above please close this issue @genignored. If you need a hand with this please let me know.

genignored commented 3 years ago

This is an interesting discussion. The field count and format is the same as a fam file, but the comments are not part of the official spec for plink outputs. In other words, if we can strip off leading lines, we should be able to use read_fam. I'd like very much to have a read function that works for all of the possible inputs. I understand the problem though.

In attempting option 2, I found that after I manually stripped off the header information, it appears that the delimiter for pedigree files from dbgap are tabs, not spaces. Would you be opposed to updating read_fam to use readr::read_table2() instead of readr::read_delim? If you don't see an issue with it, I could test, and make that pull request easily, and we shouldn't have any other issues.

genignored commented 3 years ago

This must be a Monday. I could never get a hang of Mondays. Pedigree file has an extra 7th field. Somehow I miscounted. Closing, will either write separate function, or manually parse.

stephenturner commented 3 years ago

I think it's more than what you describe above. Leading lines are easy to handle by passing ... to the function, where you use comment="#" in the dots which gets passed to read_delim. Oh, and read_table2 should probably work. But the first column is an additional column (not the fam ID). Also, you'd have to check to see how the last column is handled.

stephenturner commented 3 years ago

Easy to do this one-off

library(readr)
library(dplyr)
file <- "/data/projects/skate/data/phs001872/PhenotypeFiles/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt.gz"
read_table2(file, comment="#") %>% 
  transmute(fid=FAMILY_ID, id=SUBJECT_ID, dadid=FATHER, momid=MOTHER, sex=SEX, affected=1L)

Result

# A tibble: 603 x 6
     fid    id dadid momid   sex affected
   <dbl> <dbl> <dbl> <dbl> <dbl>    <int>
 1     1     1     2     4     2        1
 2     1     2    10     9     1        1
 3     1     3     2     4     1        1
 4     1     4     8     7     2        1
 5     1     5     2     4     1        1
 6     1     6     2     4     1        1
 7     1     7     0     0     2        1
 8     1     8     0     0     1        1
 9     1     9     0     0     2        1
10     1    10     0     0     1        1

signaturescience / skater

Alterations to read_fam #35