Closed genignored closed 3 years ago
gzipped version of pedigree file is:
/data/projects/skate/data/phs001872/PhenotypeFiles/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt.gz
Failure is no surprise here - this file isn't formatted as a PLINK-formatted .fam file. There are a few ways to handle this:
read_gapfam()
or something, that would specifically read in this type of file into the same structure that read_fam()
creates (enabling further processing with other skater functions). If this kind of file was a standard kind of file, or something we'll be reading in routinely (e.g., over hundreds of simulations), I might advocate harder for this function. But, my guess is we'd read this in once, do the post-processing with fam2ped etc, and be done with it.grep
and awk
to strip out the header and print the columns needed in the order needed for read_fam()
to work unmodified.read_tsv()
function and a pipe or two to handle this one-off case.If you agree with #-3 above please close this issue @genignored. If you need a hand with this please let me know.
This is an interesting discussion. The field count and format is the same as a fam file, but the comments are not part of the official spec for plink outputs. In other words, if we can strip off leading lines, we should be able to use read_fam. I'd like very much to have a read function that works for all of the possible inputs. I understand the problem though.
In attempting option 2, I found that after I manually stripped off the header information, it appears that the delimiter for pedigree files from dbgap are tabs, not spaces. Would you be opposed to updating read_fam to use readr::read_table2() instead of readr::read_delim? If you don't see an issue with it, I could test, and make that pull request easily, and we shouldn't have any other issues.
This must be a Monday. I could never get a hang of Mondays. Pedigree file has an extra 7th field. Somehow I miscounted. Closing, will either write separate function, or manually parse.
I think it's more than what you describe above. Leading lines are easy to handle by passing ...
to the function, where you use comment="#"
in the dots which gets passed to read_delim
. Oh, and read_table2
should probably work. But the first column is an additional column (not the fam ID). Also, you'd have to check to see how the last column is handled.
Easy to do this one-off
library(readr)
library(dplyr)
file <- "/data/projects/skate/data/phs001872/PhenotypeFiles/phs001872.v1.pht009364.v1.p1.CEPH_Utah_Pedigree.MULTI.txt.gz"
read_table2(file, comment="#") %>%
transmute(fid=FAMILY_ID, id=SUBJECT_ID, dadid=FATHER, momid=MOTHER, sex=SEX, affected=1L)
Result
# A tibble: 603 x 6
fid id dadid momid sex affected
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 1 2 4 2 1
2 1 2 10 9 1 1
3 1 3 2 4 1 1
4 1 4 8 7 2 1
5 1 5 2 4 1 1
6 1 6 2 4 1 1
7 1 7 0 0 2 1
8 1 8 0 0 1 1
9 1 9 0 0 2 1
10 1 10 0 0 1 1
We now have access to several dbGaP datasets. For the Utah Families, the pedigree file is structured as:
which looks to be compatible with plink format. However, read_fam returns problems from reading this structure:
I am unclear if skipping commented lines would solve all problems. It may also be a delimiter issue.