privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

question: how to get backingfile for bgen format? #138

Closed complexgenome closed 3 years ago

complexgenome commented 3 years ago

hi devs,

Thanks much for the amazing tool, it is too complex for me thus opening an issue.

I started with plink files with loads of NA https://github.com/privefl/bigsnpr/issues/124 Then I removed any SNP until with any missingness. This led me to h2 error that everything is negative. Error: 'h2' should have only positive values.

I'd like to work with the dosage/imputed data frim IMPUT2 tool. I created .bgen files, their index per CHR. However, I do not know how to make a backingfile for it.

  rds <- bigsnpr::snp_readBGEN(
    bgenfiles="/mnt/mfs/hgrcgrid/CHR22_subset_edited_qctool.bgen",
    bgi_dir = "/mnt/mfs/hgrcgrid/",
    ncores = 8  )

Error:

Error in path.expand(backingfile) :
  argument "backingfile" is missing, with no default

https://github.com/privefl/bigsnpr/issues/34

I'd like to work with all the SNPs in the bgen. They follow format: CHR:POS:A1:A2 that is 1:1000:C:T where A2 is effect allele. This brings me to the paramterlist_snp_id. What does it need exactly?

How do I proceed?

R version 4.0.0 (2020-04-24)
data.table_1.13.0 bigsnpr_1.4.11    bigstatsr_1.2.3
privefl commented 3 years ago

You can look at some code I've used for my papers, e.g. https://github.com/privefl/paper-ldpred2/blob/master/code/prepare-genotypes.R. But you should start by the looking at the documentation ?bigsnpr::snp_readBGEN.

I've been planning on making a tutorial on how to import various formats into my bigSNP format, but there are things preventing me from doing this right now.

complexgenome commented 3 years ago

Thank you for the link, but I was about to update this

bim_file=data.table::fread("/mnt/mfs/hgrcgrid/CHR22_subset_plink.bim")
SNP_list<-bim_file$V2

file_snpbed<-paste0("/mnt/mfs/hgrcgrid/CHR22_subset_plink.bed")
snp_readBed(file_snpbed)
  rds <- bigsnpr::snp_readBGEN( bgenfiles="/mnt/mfs/hgrcgrid/CHR22_subset_edited_qctool.bgen",
    bgi_dir = "/mnt/mfs/hgrcgrid/",
    backingfile="/mnt/mfs/hgrcgrid/CHR22_subset_plink.rds",
    ncores = 8, list_snp_id=list(SNP_list)   )

Error: Error in { : task 1 failed - "Wrong format of some variants."

I looked into the .bgen file using

bgenix -g CHR22_subset_edited_qctool.bgen -vcf | head | cut -f 1-10

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  some_IID
        16360794        22:16360794:T:C;22      T       C       .       .       .       GT:GP   0/0:1,0,0
        16362643        22:16362643:C:T;22      C       T       .       .       .       GT:GP   0/0:1,0,0
        16414031        22:16414031:G:A;22      G       A       .       .       .       GT:GP   0/0:1,0,0
        16441015        22:16441015:G:A;22      G       A       .       .       .       GT:GP   0/0:1,0,0
        16442761        22:16442761:A:G;22      A       G       .       .       .       GT:GP   0/0:1,0,0

That is there is no CHR. I was more interested in the 0 prefix https://github.com/privefl/bigsnpr/issues/97#issuecomment-660943596 which isn't case

I don't know how to transform data further. Best,

privefl commented 3 years ago

How do you convert from IMPUTE2 format to BGEN? You could look into why the chromosome information is missing, and whether you could add it with some option/parameter.

complexgenome commented 3 years ago

I was able to fix CHR and SNP IDs from the IMPUTE2 output format. The VCF files look convincing to me but the error persists when loading with snp_readBGEN

bgenix -g CHR22_subset_edited_qctool.bgen -vcf | head | cut -f 1-10

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  IID_G32868
22      16360794        22:16360794:T:C T       C       .       .       .       GT:GP   0/0:1,0,0
22      16362643        22:16362643:C:T C       T       .       .       .       GT:GP   0/0:1,0,0
22      16414031        22:16414031:G:A G       A       .       .       .       GT:GP   0/0:1,0,0
22      16441015        22:16441015:G:A G       A       .       .       .       GT:GP   0/0:1,0,0
22      16442761        22:16442761:A:G A       G       .       .       .       GT:GP   0/0:1,0,0
22      16514102        22:16514102:C:A C       A       .       .       .       GT:GP   0/0:1,0,0
22      16514234        22:16514234:A:C A       C       .       .       .       GT:GP   0/0:1,0,0
22      16518108        22:16518108:G:A G       A       .       .       .       GT:GP   0/0:1,0,0

Code to run bgen file:


 rds <- bigsnpr::snp_readBGEN(
     bgenfiles="/mnt/mfs/hgrcgrid/BGEN_conversion/CHR22_subset_edited_qctool.bgen",
     bgi_dir = "/mnt/mfs/hgrcgrid/BGEN_conversion/",
 backingfile="/mnt/mfs/hgrcgrid/CHR22_subset_plink.rds",
     ncores = 8, list_snp_id=list(SNP_list)
   )

Error: Error in { : task 1 failed - "Wrong format of some variants." Please let me know if any other details are required. Thanks,

privefl commented 3 years ago

This error is not with the bgi file, but with what you input as list_snp_id. You can e.g. show the output of str(list_snp_id).

complexgenome commented 3 years ago

Please see below output:

> str(SNP_list) chr [1:108618] "22:16360794:T:C" "22:16362643:C:T" "22:16414031:G:A" ...

> str(list(SNP_list))

List of 1 $ : chr [1:108618] "22:16360794:T:C" "22:16362643:C:T" "22:16414031:G:A" "22:16441015:G:A" ...

I'm working with only one CHR. I think data are messing them up chr [1:108618]?

privefl commented 3 years ago

The separator is _, not :. Please see the doc.

complexgenome commented 3 years ago

OK, edited completely all SNPs, bgen, plink files, still error

task 1 failed - "Some variants have not been found (stored in /mnt/mfs/hgrcgrid//CHR22_subset_edited_qctool_not_found.rds')."

the file name definitely isn't there /mnt/mfs/hgrcgrid//CHR22_subset_editedqctoolnot_found.rds' Probably it mixed pasting error message.

head(str(SNP_list))
 chr [1:108618] "22_16360794_T_C" "22_16362643_C_T" "22_16414031_G_A" ...
NULL

The number of input SNPs in the plink bim files are same as in the .gen file.

  rds <- bigsnpr::snp_readBGEN(
    bgenfiles="/mnt/mfs/hgrcgrid/CHR22_subset_edited_qctool.bgen",
    bgi_dir = "/mnt/mfs/hgrcgrid/sharedBGEN_conversion/",
    backingfile="/mnt/mfs/hgrcgrid/CHR22_subset_plink_SNPsupdated.rds",
    ncores = 8, list_snp_id=list(SNP_list)
  )

"/mnt/mfs/hgrcgrid/CHR22_subset_plink_SNPsupdated.rds" exists. I created it using:

file_snpbed<-paste0("/mnt/mfs/hgrcgrid/CHR22_subset_plink_SNPsupdated.bed")
snp_readBed(file_snpbed)
privefl commented 3 years ago

Maybe try switching alleles in the ids.

complexgenome commented 3 years ago

I don't know. The alleles are set right (A1 - minor in PLINK); also, in the VCF reference allele is shown correctly. Is there a pattern followed in SNP ids, for example: CHR_POS_REF_MINOR or CHR_POS_MINOR_REF?

I can read .rds data fine: obj.bigSNP <- snp_attach("/mnt/mfs/hgrcgrid/CHR22_subset_plink_SNPsupdated.rds")

head(obj.bigSNP$map)

chromosome       marker.ID genetic.dist physical.pos allele1 allele2
        22 22_16360794_T_C            0     16360794       C       T
        22 22_16362643_C_T            0     16362643       T       C

Data are loaded correctly with right number of SNPs dim(obj.bigSNP$map) [1] 108618 6

privefl commented 3 years ago

I use allele1 and allele2 after apparently, cf. https://github.com/privefl/bigsnpr/blob/master/R/read-bgen.R#L35.

privefl commented 3 years ago

Which is also what is specified in the doc, cf. https://github.com/privefl/bigsnpr/blob/master/R/read-bgen.R#L65.

complexgenome commented 3 years ago

Yes, https://privefl.github.io/bigsnpr/reference/snp_readBGEN.html Allele1 and allele2 are sort of ambiguous; there are certain tools that follow allele0 and allele1

I'll flip the allele and see what comes.

Thanks.

privefl commented 3 years ago

Often a0 is the same as a2. Normally, a1 is consistent.

complexgenome commented 3 years ago

I flipped alleles, when reading bgen file it throws error

Error in { :
  task 1 failed - "Some variants have not been found (stored in '/mnt/mfs/hgrcgrid/CHR22_subset_edited_qctool_not_found.rds')."

I looked into this file with failedsnps<-readRDS("/mnt/mfs/hgrcgrid/CHR22_subset_edited_qctool_not_found.rds")

Of 10K snps more than 9K SNPs failed for reason I am unable to understand at the moment.

After flipping .rds file looks like:

chromosome       marker.ID genetic.dist physical.pos allele1 allele2
         22 22_16360794_C_T            0     16360794       C       T
         22 22_16362643_T_C            0     16362643       T       C

I'm exhausted troubleshooting it. The data (plink) are loaded extremely fast, cranked really quick. But .bgen file format reading seems elusive, sorry. Thank you very much for your exhaustive assistance and support :)

privefl commented 3 years ago

The code to read the BGI file is very short, you can run the following if you don't know how to use debugonce().

db_con <- RSQLite::dbConnect(RSQLite::SQLite(), bgifile)
infos <- dplyr::collect(dplyr::tbl(db_con, "Variant"))
myid <- bigsnpr:::format_snp_id(with(infos, paste(chromosome, position, allele1, allele2, sep = "_")))
RSQLite::dbDisconnect(db_con)

myid is what the function tries to match with the vectors in list_snp_id. It should be easy to identify variants not matched and why.

privefl commented 3 years ago

Otherwise, you can send me the bgi file and the list of ids to pass. The bgi file should contain only information on the variants, no individual data.

complexgenome commented 3 years ago

I was able to get through this finally :). I've follow-up question on the bgen file read (genotypes). Should I open a new one here on this repository or put the question here? There's no comprehensive wiki or forum for the .bgen from the developers. :-| Thank you

privefl commented 3 years ago

You can open a new issue.

I'll see when I have time to make a tuto on this.

privefl commented 3 years ago

I've look into it, and I don't think there is much I could say more in a tutorial than what is in the README + documentation + examples code (I've just linked to more examples though). If you disagree, please me tell what information about reading BGEN files you think should be added.

complexgenome commented 3 years ago

I'll see when I have time to make a tuto on this.

Let me know if I can help on this as I've learned in this process so...

If you disagree, please me tell what information about reading BGEN files you think should be added.

There are few things I'd like to flesh out. I'll open appropriate issues. Cheers,