edobrijevic commented 2 years ago

Hi, I'm trying to work with the imputed data from UK Biobank. I'm using the RStudio workbench through dnanexus (as data can't be downloaded). I keep running into errors with space to create the .bk file. I am trying to run one chromosome at at time. Below is my code. Any help would be appreciated.

the .bgen bgi.bgen .mfi.txt and .sample files are uploaded to the server with dx download in terminal

chr1_all <- read.table("ukb22828_c1_b0_v3.mfi.txt") #generate list of snp ids chr1_snplist <- list(chr1_all[,1]) #make snp list chr1Data <- snp_readBGEN(bgenfiles = "ukb22828_c1_b0_v3.bgen", backingfile = "/home/rstudio-server", list_snp_id = chr1_snplist, ncores = 1)

The error is the following Error: Not enough disk space to create '/home/rstudio-server.bk'.

Furthermore - I'm assuming from the documentation I have seen that the UKBiobank eids are read with the imputed data through snp_readBGEN so it can be linked to the phenotypic data?

privefl commented 2 years ago

The UKBB data contains around 90M variants across 500K individuals, so if storing element with one byte (in the .bk file), if would require 90e6 * 500e3 / 2^40= 41 TB of disk storage. Usually I read only a subset of variants and/or individuals.
No, you should get the eids from the .sample file.

edobrijevic commented 2 years ago

Hi, thanks for the answer. Would you be able to assist with 2 follow up questions?

I do only want a subset of variants e.g. only 10 SNPs on chromosome 1. How would I only read a subset? snplist has to be the same length as the bgen file from UKBB?
How do I link the variants with the eids (form the .sample file)?

Apologies, I am new to genetic analyses.

privefl commented 2 years ago

You need to provide snplist only for these 10 variants.
I use match() to get the indices that match the eids in the .sample file with the ones in the .csv file.

edobrijevic commented 2 years ago

Thanks Florian! I have fixed the snplist issue - in the .bgen.bgi file some snps have rsids in columns 1 and 2 (as opposed to chr_pos_a1_a2). I see that we can input snps as rsids as addressed in #196?

With regards to getting the eids - I see this is noted in #217. And I tried following the linked code but it didn't work. Below is my code;

snp list for chromosome 1

chr1 <- c('1_15914545_C_T', '1_82957871_G_C', '1_94050911_G_C', '1_33760743_T_C', '1_47965130_A_G', '1_155131394_G_T', '1_200271408_C_T') chr1_snplist <- list(chr1)

getting eids

sample <- bigreadr::fread2("ukb22828_c1_b0_v3.sample")[-1, ] #reads in sample file ind.indiv <- sample$ID_2 #vector of eids

chr1Data <- snp_readBGEN(bgenfiles = "ukb22828_c1_b0_v3.bgen", backingfile = "/home/rstudio-server/chr1file_3", list_snp_id = chr1_snplist, ncores = 1, ind_row = ind.indiv[sub])

Error = Error in ind.indiv[sub] : invalid subscript type 'closure'

The above code works for snp_readBGEN if I don't include 'ind_row = ind.indiv[sub]' and reads in the bgen files. I eventually want to end up with a matrix of eids and the dosages of the SNPs. Which I see I should be able to then generate with the following code.

chr1_imp <- snp_attach(chr1Data) chr1_gen <- chr1_imp$genotypes chr1_dosage <- chr1_gen[]

privefl commented 2 years ago

Basically, ind_row needs to be matched(eid, sample$ID_2), where eid are the ones you're interesting in.

privefl commented 2 years ago

There are three linked examples in the documentation, please have a look at them.

edobrijevic commented 2 years ago

Hi Florian, I'm sorry to keep asking questions. I've been through the documentation and through some other issues (e.g. #301, #184) Can I please check that my understanding is correct

i have used ind_row = ind.indiv[sub] to read in the begn files where ind.indiv = ind.indiv <- match(cohort$eid, sample$ID_2)
the above then only reads in individuals who are in my desired cohort (true - my object is 407,293 rows, which is my cohort)
my bigsnpr object does not have a $fam
but the order of the individuals are the same as the ind.indiv vector?
so if I do the following I have a matrix of the eids by the dosages of each snp I've selected

chr1Data <- bigsnpr::snp_readBGEN( bgenfiles = "ukb22828_c1_b0_v3.bgen", list_snp_id = chr1_snplist, backingfile = "/home/rstudio-server/chr1file", ncores = 1, ind_row = ind.indiv[sub] )

chr1_imp <- snp_attach(chr1Data) chr1_gen <- chr1_imp$genotypes chr1_dosage <- chr1_gen[]

eids <- data.frame(ind.indiv) chr1_matched <- cbind(eids, chr1_dosage)

This outputs

head(chr1_matched) ind.indiv 1 2 3 4 5 6 7 1 349096 2 1.00 0 1.97 2 0 1.00 2 302763 1 1.00 0 0.00 2 0 2.00 3 36399 1 0.75 0 2.00 2 0 1.96 4 339075 0 1.00 0 2.00 1 1 2.00 5 473830 0 1.00 1 2.00 2 1 2.00 6 59902 1 1.00 0 2.00 2 1 2.00

privefl commented 2 years ago

what is sub here?
you need to read the information on individuals from the separate field with all the fields (and match the order with the eids)
yes, the eids are read in the order you provide the indices.
cohort$eid is the same as eids?

edobrijevic commented 2 years ago

sub is defined here length(sub <- which(!is.na(cohort$eid)))
my cohort file as [eid, sex, age, pc1:15...]. I made it in juptyerlab. This is where I have identified what eids I want. Is that what you mean?
That is good to know - so I think my matrix is valid
yes cohort$eid is the same as eids (I've just made it as a stand alone vector because that is sometimes how my brain works!)

privefl commented 2 years ago

This should be okay, but you need to use sub when defining eids, I think it should be cohort$eid[sub], or equivalently sample$ID_2[ind.indiv[sub]].

privefl / bigsnpr

Using snp_readBGEN on dnanexus RStudio server #357

the .bgen bgi.bgen .mfi.txt and .sample files are uploaded to the server with dx download in terminal

snp list for chromosome 1

getting eids