Extend documentation to specify file formats of required input files

erenelci commented 2 years ago

Dear Philip, thanks for the great paper and providing the code. I'm having difficulties running the code on our infrastructure. I could solve that if you could document precisely the required filenames (csvs and txts) and the column names and formats to be contained therein. This would allow me to download the fields using our infrastructure and then place them in the right format into the directories required.

E.g. which files are precisely required to be stored in in the directory specified by visit_path_raw, if one is not able to run visit_subset (but one is able to download fields and the gp data separately).

Thank you.

philipdarke commented 2 years ago

Thanks for getting in touch.

The process for most users will be:

download the bulk data extract (ukbXXXXX.enc) from the Showcase
decrypt it using the ukbunpack utility, generating ukbXXXXX.enc_ukb
convert it to ukbXXXXX.csv using the ukbconv utility

The visit_raw_path variable in file_paths.R should point to the location of ukbXXXXX.csv. This file can be very large, therefore 01_subset_visit_data.R extracts the required columns and writes them to data/visit_data.csv (the visit_path variable in file_paths.R).

The above is almost certainly the easiest way forward. You can re-download the data for your UK Biobank application and you may wish to explore doing so.

However, if you need to prepare data/visit_data.csv manually, extract all instances and arrays for the fields in the README from your database and write them to a .csv file in "wide" format. You need to replicate how UK Biobank stores the data. The column headings should be in the format [field]-[instance].[array] where instance is the visit 0-3 (some participants have repeated measurements from multiple visits) and array is the index for the observation (some fields have multiple values recorded at each visit). You also need the eid field. Each row should correspond to a participant.

For example, field 48 (waist circumference) corresponds to columns "48-0.0", "48-1.0", "48-2.0" and "48-3.0" as a single value (array 0) was recorded at visits (0, 1, 2 and 3). Self-reported medical histories have multiple conditions reported at multiple visits so field 20002 has over 100 columns (something like "20002-0.0" to "20002.3-33").

As a sense check, I would expect data/visit_data.csv to have up to around 500,000 rows and 1,470 columns, but the exact number will depend on whether you have requested all instances of each field in your application. The software should work regardless.

I will aim to expand the documentation in due course. Feedback is always welcome so please let me know how you get on.

philipdarke commented 2 years ago

A follow up point.

I'm unsure what your infrastructure limitations are - but if you are unable to run 01_subset_visit_data.R, I have added a more memory efficient implementation in Python. See 01_prepare_data/01_subset_visit_data.py. This uses pandas to read ukbXXXXX.csv in chunks.

philipdarke / ukbb-ehr-data

Extend documentation to specify file formats of required input files #2