Closed edobrijevic closed 2 years ago
The UKBB data contains around 90M variants across 500K individuals, so if storing element with one byte (in the .bk file), if would require 90e6 * 500e3 / 2^40
= 41 TB of disk storage. Usually I read only a subset of variants and/or individuals.
No, you should get the eids from the .sample file.
Hi, thanks for the answer. Would you be able to assist with 2 follow up questions?
Apologies, I am new to genetic analyses.
You need to provide snplist
only for these 10 variants.
I use match()
to get the indices that match the eids in the .sample file with the ones in the .csv file.
Thanks Florian! I have fixed the snplist issue - in the .bgen.bgi file some snps have rsids in columns 1 and 2 (as opposed to chr_pos_a1_a2). I see that we can input snps as rsids as addressed in #196?
With regards to getting the eids - I see this is noted in #217. And I tried following the linked code but it didn't work. Below is my code;
chr1 <- c('1_15914545_C_T', '1_82957871_G_C', '1_94050911_G_C', '1_33760743_T_C', '1_47965130_A_G', '1_155131394_G_T', '1_200271408_C_T') chr1_snplist <- list(chr1)
sample <- bigreadr::fread2("ukb22828_c1_b0_v3.sample")[-1, ] #reads in sample file ind.indiv <- sample$ID_2 #vector of eids
chr1Data <- snp_readBGEN(bgenfiles = "ukb22828_c1_b0_v3.bgen", backingfile = "/home/rstudio-server/chr1file_3", list_snp_id = chr1_snplist, ncores = 1, ind_row = ind.indiv[sub])
Error = Error in ind.indiv[sub] : invalid subscript type 'closure'
The above code works for snp_readBGEN if I don't include 'ind_row = ind.indiv[sub]' and reads in the bgen files. I eventually want to end up with a matrix of eids and the dosages of the SNPs. Which I see I should be able to then generate with the following code.
chr1_imp <- snp_attach(chr1Data) chr1_gen <- chr1_imp$genotypes chr1_dosage <- chr1_gen[]
Basically, ind_row
needs to be matched(eid, sample$ID_2)
, where eid
are the ones you're interesting in.
There are three linked examples in the documentation, please have a look at them.
Hi Florian, I'm sorry to keep asking questions. I've been through the documentation and through some other issues (e.g. #301, #184) Can I please check that my understanding is correct
chr1Data <- bigsnpr::snp_readBGEN( bgenfiles = "ukb22828_c1_b0_v3.bgen", list_snp_id = chr1_snplist, backingfile = "/home/rstudio-server/chr1file", ncores = 1, ind_row = ind.indiv[sub] )
chr1_imp <- snp_attach(chr1Data) chr1_gen <- chr1_imp$genotypes chr1_dosage <- chr1_gen[]
eids <- data.frame(ind.indiv) chr1_matched <- cbind(eids, chr1_dosage)
This outputs
head(chr1_matched) ind.indiv 1 2 3 4 5 6 7 1 349096 2 1.00 0 1.97 2 0 1.00 2 302763 1 1.00 0 0.00 2 0 2.00 3 36399 1 0.75 0 2.00 2 0 1.96 4 339075 0 1.00 0 2.00 1 1 2.00 5 473830 0 1.00 1 2.00 2 1 2.00 6 59902 1 1.00 0 2.00 2 1 2.00
sub
here?cohort$eid
is the same as eids
?This should be okay, but you need to use sub
when defining eids
, I think it should be cohort$eid[sub]
, or equivalently sample$ID_2[ind.indiv[sub]]
.
Hi, I'm trying to work with the imputed data from UK Biobank. I'm using the RStudio workbench through dnanexus (as data can't be downloaded). I keep running into errors with space to create the .bk file. I am trying to run one chromosome at at time. Below is my code. Any help would be appreciated.
the .bgen bgi.bgen .mfi.txt and .sample files are uploaded to the server with dx download in terminal
chr1_all <- read.table("ukb22828_c1_b0_v3.mfi.txt") #generate list of snp ids chr1_snplist <- list(chr1_all[,1]) #make snp list chr1Data <- snp_readBGEN(bgenfiles = "ukb22828_c1_b0_v3.bgen", backingfile = "/home/rstudio-server", list_snp_id = chr1_snplist, ncores = 1)
The error is the following Error: Not enough disk space to create '/home/rstudio-server.bk'.
Furthermore - I'm assuming from the documentation I have seen that the UKBiobank eids are read with the imputed data through snp_readBGEN so it can be linked to the phenotypic data?