Using dbgap2x, R package to explore, download and decrypt phenotypic and genomic data from dbGaP

You can test this software:

Using the dockerized version on your local device by running

docker run -p 80:8888 -v /var/run/docker.sock:/var/run/docker.sock  gversmee/dbgap2x

and then open your web browser at http://localhost, and use the password dbgap2x

Using your local R by installing the package with

install.packages("devtools")
devtools::install_github("gversmee/dbgap2x")

For using the package with a fresh R installation, make sure your system has the following libraries: libcurl4-openssl-dev libssl-dev libxml2-dev. Example for a debian system:

sudo apt-get update
sudo apt-get install libcurl4-openssl-dev libssl-dev libxml2-dev -y

Introduction

Load the package

#devtools::install_github("gversmee/dbgap2x", force = TRUE)
library(dbgap2x)

Get the list of the function for this new package

lsf.str("package:dbgap2x")

browse.dbgap : function (phs, no.browser = FALSE)  
browse.study : function (phs, no.browser = FALSE)  
consent.groups : function (phs)  
datatables.dict : function (phs)  
dbgap.data_dict : function (xml, dest)  
dbgap.decrypt : function (files, key = FALSE)  
dbgap.download : function (krt, key = FALSE)  
is.parent : function (phs)  
n.pop : function (phs, consentgroups = TRUE, gender = TRUE)  
n.tables : function (phs)  
n.variables : function (...)  
parent.study : function (phs)  
phs.version : function (phs)  
search.dbgap : function (term, no.browser = FALSE)  
study.name : function (phs)  
sub.study : function (phs)  
variables.dict : function (phs)

Search for dbGaP studies

Let's try to explore the "Jackson Heart Study" cohort that exists on dbGaP.

We created the function "browse.dbgap", which helps you to find the studies related to the term that you search for.

search.dbgap("Jackson")

https://www.ncbi.nlm.nih.gov/gap/?term=Jackson%5BStudy+Name%5D

Study ID	Study Name	Release Date	Nb Participants	Study Design	Project	Diseases	Ancestor ID	Ancestor Name	Molecular Data Type	Tumor Type	UID
phs001356.v1.p2	Exome Chip Genotyping: The Jackson Heart Study	2019-05-10	2788	Prospective Longitudinal Cohort	National Heart, Lung, Blood Institute	Cardiovascular Diseases, Hypertension, Diabetes Mellitus	phs000286.v6.p2	The Jackson Heart Study (JHS)	SNP Genotypes (Array)	germline	1692088
phs001098.v2.p2	T2D-GENES Multi-Ethnic Exome Sequencing Study: Jackson Heart Study	2019-05-10	1029	Case-Control	NHLBI GO-ESP	Diabetes Mellitus, Type 2	phs000286.v6.p2	The Jackson Heart Study (JHS)	SNP/CNV Genotypes (NGS), WXS	germline	1597258
phs000499.v4.p2	NHLBI Jackson Heart Study Candidate Gene Association Resource (CARe)	2019-05-10	3352	Prospective Longitudinal Cohort	NHLBI CARe	Longitudinal Studies	phs000286.v6.p2	The Jackson Heart Study (JHS)	SNP Genotypes (Array)	germline, unspecified	1597257
phs000498.v4.p2	Jackson Heart Study Allelic Spectrum Project	2019-05-10	1983	Prospective Longitudinal Cohort	National Heart, Lung, Blood Institute	Cardiovascular Diseases	phs000286.v6.p2	The Jackson Heart Study (JHS)	SNP Genotypes (NGS), WXS	germline	1597256
phs000286.v6.p2	The Jackson Heart Study (JHS)	2019-05-10	3889	Prospective Longitudinal Cohort	National Heart, Lung, Blood Institute, NHLBI GO-ESP, NHLBI CARe	Cardiovascular Diseases, Coronary Artery Disease, Diabetes Mellitus, Type 2, Obesity, Hypertension, Kidney Failure, Chronic, Stroke, Heart Failure, Peripheral Vascular Diseases, Arrhythmias, Cardiac				germline, unspecified	1597254
phs000964.v3.p1	NHLBI TOPMed: The Jackson Heart Study	2018-05-18	3596	Prospective Longitudinal Cohort	National Human Genome Research Institute	Cardiovascular Diseases, Hypertension, Diabetes Mellitus			SNP/CNV Genotypes (NGS), WGS	germline	1768620

dbGaP returns the list of the studies related to your term. As you see, there are 6 studies associated with the "Jackson Heart Study" (JHS). One of these study is the main one a.k.a the "parent study", whereas the other ones are substudies. In this case, phs000286.v5.p1 is the parent study. Firslty, we can use the phs.version() function in order to be sure that this is the latest version of the study. We can abbreviate the phs name by giving just the digit, or we can use the full dbGaP id.

phs.version("286")

'phs000286.v6.p2'

The is.parent() function is usefull to test if a study is a parent study or a substudy

is.parent("000286") # JHS main cohort
is.parent("phs499") # substudy "CARe" for JHS

TRUE

FALSE

If you don't know the parent study of a substudy, try parent.study()

parent.study("phs000499")

'phs000286.v6.p2'
'Jackson Heart Study (JHS) Cohort'

On the other side, use sub.study() to get the name and IDs of the substudies from a parent one

sub.study("286")

phs	name
phs001356.v1.p2	Exome Chip Genotyping: The Jackson Heart Study
phs000498.v4.p2	Jackson Heart Study Allelic Spectrum Project
phs001069.v1.p2	MIGen_ExS: JHS
phs000402.v4.p2	NHLBI GO-ESP: Heart Cohorts Exome Sequencing Project (JHS)
phs000499.v4.p2	NHLBI Jackson Heart Study Candidate Gene Association Resource (CARe)
phs001098.v2.p2	T2D-GENES Multi-Ethnic Exome Sequencing Study: Jackson Heart Study

If you want to get the name of a study from its dbGaP id, use study.name()

study.name("286")

'Jackson Heart Study (JHS) Cohort'

Finally, you can watch your study on dbGaP with browse.dbgap().

If a website exists for this study, you can browse it using browse.study()

browse.dbgap("286", no.browser = TRUE)
browse.study("286", no.browser = TRUE)

'https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000286.v6.p2'

'https://www.jacksonheartstudy.org'

Explore the characteristics of your study

For each dbGaP study, there can be multiple consent groups that will have there specificities. Use consent.groups to know the number and the name of the consent groups in the study that you are exploring. Let's keep focusing on JHS.

JHS <- "phs000286"
consent.groups(JHS)

	shortName	longName
0	NRUP	Subjects did not participate in the study, did not complete a consent document and are included only for the pedigree structure and/or genotype controls, such as HapMap subjects
1	HMB-IRB-NPU	Health/Medical/Biomedical (IRB, NPU)
2	DS-FDO-IRB-NPU	Disease-Specific (Focused Disease Only, IRB, NPU)
3	HMB-IRB	Health/Medical/Biomedical (IRB)
4	DS-FDO-IRB	Disease-Specific (Focused Disease Only, IRB)

Use n.pop() to know the number of patient included in each groups

n.pop(JHS)
n.pop(JHS, consentgroups = FALSE)

consent_group	male	female	total
HMB-IRB	2409	3046	5885
HMB-IRB-NPU	265	511	883
DS-FDO-IRB-NPU	63	109	201
HMB-IRB	793	1249	2289
DS-FDO-IRB	174	295	516
TOTAL	3704	5210	9774

9774

Use n.tables() and n.variables() to get the number of datatables in your study and the total number of variables

n.tables(JHS)
n.variables(JHS)

112

4856

datatables.dict() will return a data frame with the datatables IDs (phtxxxxxx) and description of your study

tablesdict <- datatables.dict(JHS)
head(tablesdict)

pht	dt_study_name	dt_label
pht008811.v1	MIGen_JHS_AA_Subject_Phenotypes	Subject ID, age, sex, cohort, consortium, T2D affection status, weight, BMI, waist circumference, height, LDL, HDL, total cholesterol, blood pressure, adiponectin, debates age, creatinine, fasting glucose, fasting insulin, HbA1C, leptin, triglycerides, and medication of participants involved in the "Myocardial Infarction Genetics Exome Sequencing Consortium: Jackson Heart Study" project.
pht008783.v1	sbpc	sbpc
pht008727.v1	allevthf	allevthf
pht001959.v2	loca	loca
pht001945.v2	cena	cena
pht001957.v2	hcaa	hcaa

variables.dict() will return a data frame with the variables IDs (phvxxxxxx), their name in the study, the datatable where they come from and their description

vardict <- variables.dict(JHS)
head(vardict)

dt_study_name	phv	var_name	var_desc
MIGen_JHS_AA_Subject_Phenotypes	phv00404354.v1	SUBJECT_ID	De-identified Subject ID
MIGen_JHS_AA_Subject_Phenotypes	phv00404355.v1	sex	Gender of participant
sbpc	phv00403830.v1	SUBJECT_ID	PARTICIPANT ID [Visit 1] [Sitting Blood Pressure Form, SBP]
sbpc	phv00403831.v1	VISIT	CONTACT OCCASION [Visit 1] [Sitting Blood Pressure Form, SBP]
sbpc	phv00403832.v1	SBPC1	Q1. A. Temperature. Room temperature (degrees centigrade) [Visit 1] [Sitting Blood Pressure Form, SBP]
sbpc	phv00403833.v1	SBPC2	Q2. B. Tobacco and caffeine use, physical activity, and medication. Have you smoked or used chewing tobacco, nicotine gum or snuff today or do you wear a nicotine patch? [Visit 1] [Sitting Blood Pressure Form, SBP]

Extract your study

Get your dbGaP repository key

In order to download or decrypt your data from dbGaP, you will need to request an access and to get a decryption key. Follow those steps to access your dbGaP repository key:

- Go to https://www.ncbi.nlm.nih.gov/gap and click on `controlled access data`

- Click on Log in to dbGaP

- Identify yourself with your era common ID and password

- Get a PI dbGaP repository key:

In order to download the files and to decrypt them, you will need a decryption key. This key can be found on a PI dbGaP account. Go to the Authorized Access and then My Projects tabs. Then, in the column Actions on the right of your screen, find Get no password dbGaP repository key.

Decrypt the .ncbi_enc files

On dbGaP, the phenotypic files are encrypted. We created a decryption function that uses a dockerized version on sratoolkit. To use that function, you need to have docker installed on your device (www.docker.com). If you are using the dockerized version of this software (available at hub.docker.com/r/gversmee/dbgap2x), docker is already pre-installed, but you'll need to upload your key on the jupyter working directory.

key <- "path/to/your/key.ngc"
files <- "path/to/directory/ofencrypted_files"
dbgap.decrypt(files, key)

You should see a "decrypted_files" directory in the directory where your encrypted files are located

Download dbGaP files

- Click on "file selector"

This gives you access to the dbGaP file selector where you can find all the files available for the selected project. To find it, go to the Authorized Access and then My Projects tabs. Then, in the column Actions on the right of your screen, find file selector.

- Filter by study accession

Here, we want to get the phenotypic data for the study "Early onset COPD", so after checking Study accession, we select "phs000946".

- Filter again

Since we are only interested in getting the phenotypic data, let's filter by Content type and select phenotype individual-auxiliary and phenotype individual-traits.

- Select the files

Click on "+" to select all the files.

- Click on "Cart file"

This will downlaod a .krt file in your download folder.

Download and decrypt the files

key <- "path/to/your/key.ngc"
cart <- "path/to/your/cartfile.krt"
dbgap.download(cart, key)

You should see in your working directory a new folder named dbGaP-*** that contains your files

gversmee / dbgap2x

readme

Using dbgap2x, R package to explore, download and decrypt phenotypic and genomic data from dbGaP

Introduction

Load the package

Get the list of the function for this new package

Search for dbGaP studies

Let's try to explore the "Jackson Heart Study" cohort that exists on dbGaP.

We created the function "browse.dbgap", which helps you to find the studies related to the term that you search for.

The is.parent() function is usefull to test if a study is a parent study or a substudy

If you don't know the parent study of a substudy, try parent.study()

On the other side, use sub.study() to get the name and IDs of the substudies from a parent one

If you want to get the name of a study from its dbGaP id, use study.name()

Finally, you can watch your study on dbGaP with browse.dbgap().

If a website exists for this study, you can browse it using browse.study()

Explore the characteristics of your study

For each dbGaP study, there can be multiple consent groups that will have there specificities. Use consent.groups to know the number and the name of the consent groups in the study that you are exploring. Let's keep focusing on JHS.

Use n.pop() to know the number of patient included in each groups

Use n.tables() and n.variables() to get the number of datatables in your study and the total number of variables

datatables.dict() will return a data frame with the datatables IDs (phtxxxxxx) and description of your study

variables.dict() will return a data frame with the variables IDs (phvxxxxxx), their name in the study, the datatable where they come from and their description

Extract your study

Get your dbGaP repository key

- Go to https://www.ncbi.nlm.nih.gov/gap and click on `controlled access data`

- Click on Log in to dbGaP

- Identify yourself with your era common ID and password

- Get a PI dbGaP repository key:

Decrypt the .ncbi_enc files

Download dbGaP files

- Click on "file selector"

- Filter by study accession

- Filter again

- Select the files

- Click on "Cart file"

Download and decrypt the files

gversmee / dbgap2x

readme

Using dbgap2x, R package to explore, download and decrypt phenotypic and genomic data from dbGaP

Introduction

Load the package

Get the list of the function for this new package

Search for dbGaP studies

Let's try to explore the "Jackson Heart Study" cohort that exists on dbGaP.

We created the function "browse.dbgap", which helps you to find the studies related to the term that you search for.

The is.parent() function is usefull to test if a study is a parent study or a substudy

If you don't know the parent study of a substudy, try parent.study()

On the other side, use sub.study() to get the name and IDs of the substudies from a parent one

If you want to get the name of a study from its dbGaP id, use study.name()

Finally, you can watch your study on dbGaP with browse.dbgap().

If a website exists for this study, you can browse it using browse.study()

Explore the characteristics of your study

For each dbGaP study, there can be multiple consent groups that will have there specificities. Use consent.groups to know the number and the name of the consent groups in the study that you are exploring. Let's keep focusing on JHS.

Use n.pop() to know the number of patient included in each groups

Use n.tables() and n.variables() to get the number of datatables in your study and the total number of variables

datatables.dict() will return a data frame with the datatables IDs (phtxxxxxx) and description of your study

variables.dict() will return a data frame with the variables IDs (phvxxxxxx), their name in the study, the datatable where they come from and their description

Extract your study

Get your dbGaP repository key

- Go to https://www.ncbi.nlm.nih.gov/gap and click on controlled access data

- Click on Log in to dbGaP

- Identify yourself with your era common ID and password

- Get a PI dbGaP repository key:

Decrypt the .ncbi_enc files

Download dbGaP files

- Click on "file selector"

- Filter by study accession

- Filter again

- Select the files

- Click on "Cart file"

Download and decrypt the files

- Go to https://www.ncbi.nlm.nih.gov/gap and click on `controlled access data`