rmgpanw / ukbwranglr

R package for UK Biobank data wrangling.
https://rmgpanw.github.io/ukbwranglr/
Other
14 stars 1 forks source link

Filter according to eid in read_ukb() #7

Closed dwuab closed 7 months ago

dwuab commented 8 months ago

Suppose we already have of a small list of eids of interest. It would be inefficient to first read all data fields of all eids and retain those eids in the list. Could there be an option added to read_ukb() such that only data belonging to a given list of eids will be returned by read_ukb()?

rmgpanw commented 7 months ago

Hi @dwuab, many thanks for reaching out and apologies for my delayed response.

Unfortunately I don't see a way to achieve this. read_ukb() is essentially a wrapper around data.table::fread() - this does include a cmd argument that could be used to pre-process the file with a command line tool like awk (e.g. awk 'BEGIN {FS = "\t"} NR == FNR {values[$1] = 1; next} $1 in values' values.txt data.txt filters data.txt for rows where the first column of values matches any of those in values.txt).

However, (i) this would still then be a 2-step process when reading into R (filter rows using command line tool, then read result into R) and (ii) would not work across OS (wouldn't work on Windows).

Personally, for each UK Biobank project I create a processed dataset with selected columns and rows (eids) specific to the project, then save this in a compressed file format (e.g. .rds) for subsequent analyses. I find the targets package very helpful for this.

Hope this is helpful. Thanks for your interest in ukbwranglr!