Title: ukbREST: efficient and streamlined data access for reproducible research of large biobanks
Authors: Milton Pividori and Hae Kyung Im
DOI: https://doi.org/10.1093/bioinformatics/bty925
Im-Lab (http://hakyimlab.org/), Section of Genetic Medicine, Department of Medicine, The University of Chicago.
Center for Translational Data Science (https://ctds.uchicago.edu/), The University of Chicago.
Join our mailing list here: https://groups.google.com/d/forum/ukbrest
Large biobanks, such as UK Biobank with half a million participants, are changing the scale and availability of genotypic and phenotypic data for researchers to ask fundamental questions about the biology of health and disease. The breadth of the UK Biobank data is enabling discoveries at an unprecedented pace. However, this size and complexity pose new challenges to investigators who need to keep the accruing data up to date, comply with potential consent changes, and efficiently and reproducibly extract subsets of the data to answer specific scientific questions. Here we propose a tool called ukbREST designed for the UK Biobank study (easily extensible to other biobanks), which allows authorized users to efficiently retrieve phenotypic and genetic data. It exposes a REST API that makes data highly accessible inside a private and secure network, allowing the data specification in a human readable text format easily shareable with other researchers. These characteristics make ukbREST an important tool to make biobank’s valuable data more readily accessible to the research community and facilitate reproducibility of the analysis, a key aspect of science.
Duplicated data-fields
).You only need to install ukbREST in a server/computer; clients can connect to it and
make queries just using standard tools like curl
. The quickest way to get ukbREST is to use
our Docker image. So install
Docker and follow the steps below. Just make sure, once
you installed Docker, that you have enough disk space (in macOS go to Preferences/Disk and increase the
value). Take a look a the wiki to know the general specifications expected for a computer/server.
If you just want to give ukbREST a try, and you are not a UK Biobank user, you can follow the guide in the wiki and use our simulated data.
If you are an approved UK Biobank researcher you are probably already familiar with this.
Once you downloaded your encrypted application files, decrypt them and convert them
to CSV and HTML formats using ukbconv
. Checkout the
Data Showcase documentation.
Copy all CSV and HTML files to a particular folder (for example, called phenotype
).
You will have one CSV and one HTML file per dataset, each one with a specific Basket ID, like
for example the ones shown below for four different datasets with Basket IDs 1111, 2222, 3333, 4444:
$ ls -lh phenotype/*
-rw-rw-r-- 1 6.6G Jul 2 23:22 phenotype/ukb1111.csv
-rw-rw-r-- 1 6.4M Jul 2 23:19 phenotype/ukb1111.html
-rw-rw-r-- 1 2.7G Jul 2 23:20 phenotype/ukb2222.csv
-rw-rw-r-- 1 4.5M Jul 2 23:19 phenotype/ukb2222.html
-rw-rw-r-- 1 1012M Jul 2 23:22 phenotype/ukb3333.csv
-rw-rw-r-- 1 192K Jul 2 23:19 phenotype/ukb3333.html
-rw-rw-r-- 1 22G Jul 2 23:24 phenotype/ukb4444.csv
-rw-rw-r-- 1 4.1M Jul 2 23:19 phenotype/ukb4444.html
Make sure your phenotype CSV files do not have overlapping data-fields (use the latest data refresh for each basket).
For the genotype data you'll also have a specific folder, for instance, called genotype
.
Here you have to copy your bgen
, bgi
(BGEN index files) and sample
(BGEN sample) files:
$ ls -lh genotype/*
-rw-rw-r-- 1 114G Mar 16 09:51 genotype/ukb_imp_chr10_v3.bgen
-rw-rw-r-- 1 198M Mar 16 10:12 genotype/ukb_imp_chr10_v3.bgen.bgi
-rw-rw-r-- 1 109G Mar 16 09:52 genotype/ukb_imp_chr11_v3.bgen
-rw-rw-r-- 1 201M Mar 16 10:12 genotype/ukb_imp_chr11_v3.bgen.bgi
-rw-rw-r-- 1 109G Mar 16 09:54 genotype/ukb_imp_chr12_v3.bgen
[...]
-rw-rw-r-- 1 9.3M Apr 6 09:41 genotype/ukb12345_imp_chr1_v3_s487395.sample
Here we are going to start PostgreSQL and load the phenotype data into it. Start Docker in your server/computer and pull the PostgreSQL and ukbREST images:
$ docker pull postgres:11
$ docker pull hakyimlab/ukbrest
Create a network in Docker that we'll use to connect ukbREST with PostgreSQL:
$ docker network create ukb
Start the PostgreSQL container (here we are using user test
with password test
; you should
choose a stronger one):
$ docker run -d --name pg --net ukb -p 127.0.0.1:5432:5432 \
-e POSTGRES_USER=test -e POSTGRES_PASSWORD=test \
-e POSTGRES_DB=ukb \
postgres:11
Keep in mind that the above command runs PostgreSQL with the default settings. That could make it work really slow when you send a query to ukbREST. See the installation instructions in the wiki for more details.
Then use the ukbREST Docker image to load your phenotype data into the PostgreSQL database. Here we are only loading your CSV/HTML main datasets, but keep in mind that you can also load Sample-QC or relatedness data, which is provided separately in UK Biobank. This is covered in the wiki.
In the command below, replace the bold text with the full path of both your phenotype and genotype folder,
as well as the right name of your .sample
file.
$ docker run --rm --net ukb \ -v /full/path/to/genotype/folder/:/var/lib/genotype \ -v /full/path/to/phenotype/folder/:/var/lib/phenotype \ -e UKBREST_GENOTYPE_BGEN_SAMPLE_FILE="ukb12345_imp_chr1_v3_s487395.sample" \ -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \ -e UKBREST_LOADING_N_JOBS=2 \ hakyimlab/ukbrest --load [...] 2018-07-20 22:50:34,962 - ukbrest - INFO - Loading finished!
Sometimes we found that the CSV file have a wrong encoding, making Python fail when reading the file. If ukbREST found this, you'll see an error message about Unicode decoding error. Check out the documentation to know how to fix it.
You can also adjust the number of cores used when loading the data with the
variable UKBREST_LOADING_N_JOBS
(set to 2 cores in the example above).
The documentation also explain the SQL schema, so you can take full advantage of it.
Once your main datasets are loaded, you only need to complete two more steps: 1) load the data-field codings and 2) some useful SQL functions. You do this by just running two commands.
To load the data-field codings, run this:
$ docker run --rm --net ukb \
-e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
hakyimlab/ukbrest --load-codings
This will load most of the data-field codings from the UK Biobank Data Showcase (they are in .tsv
format in
the codings folder). This includes, for instance,
data coding 19, which is used for
data-field 41202
(Diagnoses - main ICD10).
For your application, however, you could need to download a few more if you have specific data-fields.
This is covered in the documentation.
Finally, run this command to create some useful SQL functions you will likely use in your queries:
$ docker run --rm --net ukb \
-e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
hakyimlab/ukbrest --load-sql
Now you only need to start the ukbREST server:
$ docker run --rm --net ukb -p 127.0.0.1:5000:5000 \ -e UKBREST_SQL_CHUNKSIZE="10000" \ -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \ hakyimlab/ukbrest
For security reasons, note that with these commands both the ukbREST server and the PostgreSQL are only reachable from your own computer/server. No one from the network will be able to make any queries other than you from the computer where ukbREST is running.
Check out the documentation to setup ukbREST in a private and secure network and how to add user authentication and SSL encryption.
Once the ukbREST is up and running, you can request any data-field using
different query methods.
Column names for data-fields have this format: c{DATA_FIELD_ID}_{INSTANCE}_{ARRAY}
.
ukbREST lets you make queries in different ways. If you only need to access some data-fields,
you can use standard tools like curl
to make your query. You can also use a YAML file to write
your data specification in one place and easily share it (for instance, when submitting your manuscript),
improving reproducibility of results for others working on UK Biobank. You can also specify the output file format (for example, CSV or the format used by plink or BGENIE).
You can request a single or multiple data-fields using standard tools like curl
:
Here we request two data-fields:
c50_0_0
. We rename this data-field to height
.c21002_2_0
. We rename it to weight
.
$ curl -G \
-HAccept:text/csv \
"http://127.0.0.1:5000/ukbrest/api/v1.0/phenotype" \
--data-urlencode "columns=c50_0_0 as height" \
--data-urlencode "columns=c21002_1_0 as weight" \
> my_data.csv
Your data will be saved in file my_data.csv
.
You can write your data specification in a YAML file. Take a look at this real example (we don't show results, of course, but you can try it with your UK Biobank data):
$ cat my_query.yaml
samples_filters:
- c22006_0_0 = '1'
- eid > 0
data:
sex: c31_0_0
smoking_status: >
coalesce(
nullifneg(c20116_2_0), nullifneg(c20116_1_0), nullifneg(c20116_0_0)
)
asthma:
case_control:
20002:
coding: 1111
41202:
coding: [J45, J450, J451, J458, J459]
hypertension:
sql:
1: >
eid in (
select eid from events
where field_id in (values(20002)) and event in (
select * from get_children_codings('20002', array[1081])
)
)
0: >
eid not in (
select eid from events
where field_id in (values(20002)) and event in (
select * from get_children_codings('20002', array[1081, 1085])
)
)
$ curl -X POST \
-H "Accept: text/csv" \
-F file=@my_query.yaml \
-F section=data \
http://127.0.0.1:5000/ukbrest/api/v1.0/query \
> my_data.csv
The YAML file above has two sections: samples_filters
which is a set of filters applied to all samples
(in the example above we are considering Caucasian specified in data-field 22006), and data
which defines
a data specification that will be translated to a CSV file later. You can have as many data
specifications in one file as you want (you choose the one you want when calling curl
). The samples_filters
will be applied on all of them.
The data
section has four columns:
sex
: it just select data field 31, instance 0, array 0.smoking_status
: picks the first non empty value from all instances of
data-field 20116, giving priority to the
latest data from instance 2 to instance 0. Since this data-field has
a coding that says that negative values are
those that Prefer not to answer
, we consider these values as empty using the function nullifneg
(null if negative).asthma
: this one uses a feature for binary columns called case_control
. Cases
(with value 1
for this column) will include all samples that have
self-reported asthma (data-field 20002
with value 1111
, which means asthma) or that have an ICD10 code (hospital
level data) that indicates asthma (J45
, J450
, J451
, J458
, J459
). All the rest that don't meet
this criteria are controls (with value 0
for this column).hypertension
: here we use a more advanced feature called sql
, better suited for complex real scenarios, and also employ another feature to select children of a hierarchically organized data-field (like self-reported diseases or ICD10 codes). First, with sql
, you can specify a column with several categorical values: 1
and 0
in this case; for each of them you can write the SQL code with the conditions. The SQL code for category 1
will contain all samples that have self-reported (data-field 20002)
any disease in the tree of cardiovascular/hypertension: this includes hypertension
itself but also essential hypertension
and gestational hypertension/pre-eclampsia
. For this you use the get_children_codings
SQL function, indicating the data-field (20002) and the node id of the disease of interest (1081
for hypertension; take a look at
the codings for data-field 20002). In this case we are including all instances of data-field 20002. Something similar is done for category 0
, but in this case we are excluding (eid not in...
) all individuals with any disease under parents hypertension
(node id 1081
) and venous thromboembolic disease
(node id 1085
). Keep in mind that function get_children_codings
works recursively, so all children down in the tree will be selected. If you would like, for example, to choose all individuals with any self-reported cardiovascular disease you would use get_children_codings('20002', array[1071])
.The wiki contains a page with real examples of YAML files. We encourage you to share yours!
When you started ukbREST before, you didn't specified the genotype directory. This is fine if you are planning to just query data-fields. If you do want to get BGEN subsets, you need to add two parameters when staring ukbREST:
$ docker run --rm --net ukb -p 127.0.0.1:5000:5000 \ -v /full/path/to/genotype/folder/:/var/lib/genotype \ -e UKBREST_GENOTYPE_BGEN_FILE_NAMING="ukb_imp_chr{:d}_v3.bgen" \ -e UKBREST_SQL_CHUNKSIZE="10000" \ -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \ hakyimlab/ukbrest
Look at the bold text above. You need to put your full path to the genotype folder (where both the bgen
and
bgi
index files reside), and also specify the bgen
file name template with the environmental variable
UKBREST_GENOTYPE_BGEN_FILE_NAMING
. The substring {:d}
will be replaced by the chromosome number.
So if you want to get a subset of the chromosome 22, let's say position from 0 to 1000, you run something like this:
$ curl http://localhost:5000/ukbrest/api/v1.0/genotype/22/positions/0/1000 \
> chr22_subset.bgen
With the query below, you can get a subset of the BGEN using a file specifying rsids:
$ cat rsids.txt
rs367896724
rs540431307
rs555500075
rs548419688
rs568405545
rs534229142
rs537182016
rs376342519
rs558604819
$ curl -X POST \
-F file=@rsids.txt \
http://localhost:5000/ukbrest/api/v1.0/genotype/1/rsids \
> chr1_subset.bgen
Note that in these two examples you get a bgen
(binary) file. If you want to read it from your scripts in Python,
for instance, you can use a package like this one: https://github.com/limix/bgen-reader-py
Check out the wiki pages for more information.