privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Using snp_readBGEN on dnanexus RStudio server #357

Closed edobrijevic closed 2 years ago

edobrijevic commented 2 years ago

Hi, I'm trying to work with the imputed data from UK Biobank. I'm using the RStudio workbench through dnanexus (as data can't be downloaded). I keep running into errors with space to create the .bk file. I am trying to run one chromosome at at time. Below is my code. Any help would be appreciated.

the .bgen bgi.bgen .mfi.txt and .sample files are uploaded to the server with dx download in terminal

chr1_all <- read.table("ukb22828_c1_b0_v3.mfi.txt") #generate list of snp ids chr1_snplist <- list(chr1_all[,1]) #make snp list chr1Data <- snp_readBGEN(bgenfiles = "ukb22828_c1_b0_v3.bgen", backingfile = "/home/rstudio-server", list_snp_id = chr1_snplist, ncores = 1)

The error is the following Error: Not enough disk space to create '/home/rstudio-server.bk'.

Furthermore - I'm assuming from the documentation I have seen that the UKBiobank eids are read with the imputed data through snp_readBGEN so it can be linked to the phenotypic data?

privefl commented 2 years ago
edobrijevic commented 2 years ago

Hi, thanks for the answer. Would you be able to assist with 2 follow up questions?

  1. I do only want a subset of variants e.g. only 10 SNPs on chromosome 1. How would I only read a subset? snplist has to be the same length as the bgen file from UKBB?
  2. How do I link the variants with the eids (form the .sample file)?

Apologies, I am new to genetic analyses.

privefl commented 2 years ago
edobrijevic commented 2 years ago

Thanks Florian! I have fixed the snplist issue - in the .bgen.bgi file some snps have rsids in columns 1 and 2 (as opposed to chr_pos_a1_a2). I see that we can input snps as rsids as addressed in #196?

With regards to getting the eids - I see this is noted in #217. And I tried following the linked code but it didn't work. Below is my code;

snp list for chromosome 1

chr1 <- c('1_15914545_C_T', '1_82957871_G_C', '1_94050911_G_C', '1_33760743_T_C', '1_47965130_A_G', '1_155131394_G_T', '1_200271408_C_T') chr1_snplist <- list(chr1)

getting eids

sample <- bigreadr::fread2("ukb22828_c1_b0_v3.sample")[-1, ] #reads in sample file ind.indiv <- sample$ID_2 #vector of eids

chr1Data <- snp_readBGEN(bgenfiles = "ukb22828_c1_b0_v3.bgen", backingfile = "/home/rstudio-server/chr1file_3", list_snp_id = chr1_snplist, ncores = 1, ind_row = ind.indiv[sub])

Error = Error in ind.indiv[sub] : invalid subscript type 'closure'

The above code works for snp_readBGEN if I don't include 'ind_row = ind.indiv[sub]' and reads in the bgen files. I eventually want to end up with a matrix of eids and the dosages of the SNPs. Which I see I should be able to then generate with the following code.

chr1_imp <- snp_attach(chr1Data) chr1_gen <- chr1_imp$genotypes chr1_dosage <- chr1_gen[]

privefl commented 2 years ago

Basically, ind_row needs to be matched(eid, sample$ID_2), where eid are the ones you're interesting in.

privefl commented 2 years ago

There are three linked examples in the documentation, please have a look at them.

edobrijevic commented 2 years ago

Hi Florian, I'm sorry to keep asking questions. I've been through the documentation and through some other issues (e.g. #301, #184) Can I please check that my understanding is correct

chr1Data <- bigsnpr::snp_readBGEN( bgenfiles = "ukb22828_c1_b0_v3.bgen", list_snp_id = chr1_snplist, backingfile = "/home/rstudio-server/chr1file", ncores = 1, ind_row = ind.indiv[sub] )

chr1_imp <- snp_attach(chr1Data) chr1_gen <- chr1_imp$genotypes chr1_dosage <- chr1_gen[]

eids <- data.frame(ind.indiv) chr1_matched <- cbind(eids, chr1_dosage)

This outputs

head(chr1_matched) ind.indiv 1 2 3 4 5 6 7 1 349096 2 1.00 0 1.97 2 0 1.00 2 302763 1 1.00 0 0.00 2 0 2.00 3 36399 1 0.75 0 2.00 2 0 1.96 4 339075 0 1.00 0 2.00 1 1 2.00 5 473830 0 1.00 1 2.00 2 1 2.00 6 59902 1 1.00 0 2.00 2 1 2.00

privefl commented 2 years ago
edobrijevic commented 2 years ago
privefl commented 2 years ago

This should be okay, but you need to use sub when defining eids, I think it should be cohort$eid[sub], or equivalently sample$ID_2[ind.indiv[sub]].