Closed complexgenome closed 3 years ago
You can look at some code I've used for my papers, e.g. https://github.com/privefl/paper-ldpred2/blob/master/code/prepare-genotypes.R.
But you should start by the looking at the documentation ?bigsnpr::snp_readBGEN
.
I've been planning on making a tutorial on how to import various formats into my bigSNP format, but there are things preventing me from doing this right now.
Thank you for the link, but I was about to update this
bim_file=data.table::fread("/mnt/mfs/hgrcgrid/CHR22_subset_plink.bim")
SNP_list<-bim_file$V2
file_snpbed<-paste0("/mnt/mfs/hgrcgrid/CHR22_subset_plink.bed")
snp_readBed(file_snpbed)
rds <- bigsnpr::snp_readBGEN( bgenfiles="/mnt/mfs/hgrcgrid/CHR22_subset_edited_qctool.bgen",
bgi_dir = "/mnt/mfs/hgrcgrid/",
backingfile="/mnt/mfs/hgrcgrid/CHR22_subset_plink.rds",
ncores = 8, list_snp_id=list(SNP_list) )
Error:
Error in { : task 1 failed - "Wrong format of some variants."
I looked into the .bgen file using
bgenix -g CHR22_subset_edited_qctool.bgen -vcf | head | cut -f 1-10
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT some_IID
16360794 22:16360794:T:C;22 T C . . . GT:GP 0/0:1,0,0
16362643 22:16362643:C:T;22 C T . . . GT:GP 0/0:1,0,0
16414031 22:16414031:G:A;22 G A . . . GT:GP 0/0:1,0,0
16441015 22:16441015:G:A;22 G A . . . GT:GP 0/0:1,0,0
16442761 22:16442761:A:G;22 A G . . . GT:GP 0/0:1,0,0
That is there is no CHR. I was more interested in the 0
prefix https://github.com/privefl/bigsnpr/issues/97#issuecomment-660943596 which isn't case
I don't know how to transform data further. Best,
How do you convert from IMPUTE2 format to BGEN? You could look into why the chromosome information is missing, and whether you could add it with some option/parameter.
I was able to fix CHR and SNP IDs from the IMPUTE2
output format. The VCF
files look convincing to me but the error persists when loading with snp_readBGEN
bgenix -g CHR22_subset_edited_qctool.bgen -vcf | head | cut -f 1-10
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT IID_G32868
22 16360794 22:16360794:T:C T C . . . GT:GP 0/0:1,0,0
22 16362643 22:16362643:C:T C T . . . GT:GP 0/0:1,0,0
22 16414031 22:16414031:G:A G A . . . GT:GP 0/0:1,0,0
22 16441015 22:16441015:G:A G A . . . GT:GP 0/0:1,0,0
22 16442761 22:16442761:A:G A G . . . GT:GP 0/0:1,0,0
22 16514102 22:16514102:C:A C A . . . GT:GP 0/0:1,0,0
22 16514234 22:16514234:A:C A C . . . GT:GP 0/0:1,0,0
22 16518108 22:16518108:G:A G A . . . GT:GP 0/0:1,0,0
Code to run bgen file:
rds <- bigsnpr::snp_readBGEN(
bgenfiles="/mnt/mfs/hgrcgrid/BGEN_conversion/CHR22_subset_edited_qctool.bgen",
bgi_dir = "/mnt/mfs/hgrcgrid/BGEN_conversion/",
backingfile="/mnt/mfs/hgrcgrid/CHR22_subset_plink.rds",
ncores = 8, list_snp_id=list(SNP_list)
)
Error: Error in { : task 1 failed - "Wrong format of some variants."
Please let me know if any other details are required.
Thanks,
This error is not with the bgi file, but with what you input as list_snp_id
.
You can e.g. show the output of str(list_snp_id)
.
Please see below output:
> str(SNP_list)
chr [1:108618] "22:16360794:T:C" "22:16362643:C:T" "22:16414031:G:A" ...
> str(list(SNP_list))
List of 1
$ : chr [1:108618] "22:16360794:T:C" "22:16362643:C:T" "22:16414031:G:A" "22:16441015:G:A" ...
I'm working with only one CHR. I think data are messing them up chr [1:108618]
?
The separator is _
, not :
.
Please see the doc.
OK, edited completely all SNPs, bgen, plink files, still error
task 1 failed - "Some variants have not been found (stored in /mnt/mfs/hgrcgrid//CHR22_subset_edited_qctool_not_found.rds')."
the file name definitely isn't there /mnt/mfs/hgrcgrid//CHR22_subset_editedqctoolnot_found.rds' Probably it mixed pasting error message.
head(str(SNP_list))
chr [1:108618] "22_16360794_T_C" "22_16362643_C_T" "22_16414031_G_A" ...
NULL
The number of input SNPs in the plink bim files are same as in the .gen file.
rds <- bigsnpr::snp_readBGEN(
bgenfiles="/mnt/mfs/hgrcgrid/CHR22_subset_edited_qctool.bgen",
bgi_dir = "/mnt/mfs/hgrcgrid/sharedBGEN_conversion/",
backingfile="/mnt/mfs/hgrcgrid/CHR22_subset_plink_SNPsupdated.rds",
ncores = 8, list_snp_id=list(SNP_list)
)
"/mnt/mfs/hgrcgrid/CHR22_subset_plink_SNPsupdated.rds"
exists. I created it using:
file_snpbed<-paste0("/mnt/mfs/hgrcgrid/CHR22_subset_plink_SNPsupdated.bed")
snp_readBed(file_snpbed)
Maybe try switching alleles in the ids.
I don't know. The alleles are set right (A1 - minor in PLINK); also, in the VCF reference allele is shown correctly. Is there a pattern followed in SNP ids, for example: CHR_POS_REF_MINOR
or CHR_POS_MINOR_REF
?
I can read .rds
data fine:
obj.bigSNP <- snp_attach("/mnt/mfs/hgrcgrid/CHR22_subset_plink_SNPsupdated.rds")
head(obj.bigSNP$map)
chromosome marker.ID genetic.dist physical.pos allele1 allele2
22 22_16360794_T_C 0 16360794 C T
22 22_16362643_C_T 0 16362643 T C
Data are loaded correctly with right number of SNPs dim(obj.bigSNP$map)
[1] 108618 6
I use allele1
and allele2
after apparently, cf. https://github.com/privefl/bigsnpr/blob/master/R/read-bgen.R#L35.
Which is also what is specified in the doc, cf. https://github.com/privefl/bigsnpr/blob/master/R/read-bgen.R#L65.
Yes, https://privefl.github.io/bigsnpr/reference/snp_readBGEN.html
Allele1
and allele2
are sort of ambiguous; there are certain tools that follow allele0
and allele1
I'll flip the allele and see what comes.
Thanks.
Often a0
is the same as a2
. Normally, a1
is consistent.
I flipped alleles, when reading bgen file it throws error
Error in { :
task 1 failed - "Some variants have not been found (stored in '/mnt/mfs/hgrcgrid/CHR22_subset_edited_qctool_not_found.rds')."
I looked into this file with
failedsnps<-readRDS("/mnt/mfs/hgrcgrid/CHR22_subset_edited_qctool_not_found.rds")
Of 10K snps more than 9K SNPs failed for reason I am unable to understand at the moment.
After flipping .rds
file looks like:
chromosome marker.ID genetic.dist physical.pos allele1 allele2
22 22_16360794_C_T 0 16360794 C T
22 22_16362643_T_C 0 16362643 T C
I'm exhausted troubleshooting it. The data (plink
) are loaded extremely fast, cranked really quick. But .bgen
file format reading seems elusive, sorry.
Thank you very much for your exhaustive assistance and support :)
The code to read the BGI file is very short, you can run the following if you don't know how to use debugonce()
.
db_con <- RSQLite::dbConnect(RSQLite::SQLite(), bgifile)
infos <- dplyr::collect(dplyr::tbl(db_con, "Variant"))
myid <- bigsnpr:::format_snp_id(with(infos, paste(chromosome, position, allele1, allele2, sep = "_")))
RSQLite::dbDisconnect(db_con)
myid
is what the function tries to match with the vectors in list_snp_id
.
It should be easy to identify variants not matched and why.
Otherwise, you can send me the bgi file and the list of ids to pass. The bgi file should contain only information on the variants, no individual data.
I was able to get through this finally :). I've follow-up question on the bgen file read (genotypes). Should I open a new one here on this repository or put the question here? There's no comprehensive wiki or forum for the .bgen from the developers. :-| Thank you
You can open a new issue.
I'll see when I have time to make a tuto on this.
I've look into it, and I don't think there is much I could say more in a tutorial than what is in the README + documentation + examples code (I've just linked to more examples though). If you disagree, please me tell what information about reading BGEN files you think should be added.
I'll see when I have time to make a tuto on this.
Let me know if I can help on this as I've learned in this process so...
If you disagree, please me tell what information about reading BGEN files you think should be added.
There are few things I'd like to flesh out. I'll open appropriate issues. Cheers,
hi devs,
Thanks much for the amazing tool, it is too complex for me thus opening an issue.
I started with plink files with loads of NA https://github.com/privefl/bigsnpr/issues/124 Then I removed any SNP until with any missingness. This led me to h2 error that everything is negative.
Error: 'h2' should have only positive values.
I'd like to work with the
dosage/imputed data
frim IMPUT2 tool. I created.bgen
files, their index per CHR. However, I do not know how to make a backingfile for it.Error:
https://github.com/privefl/bigsnpr/issues/34
I'd like to work with all the SNPs in the bgen. They follow format:
CHR:POS:A1:A2
that is1:1000:C:T
where A2 is effect allele. This brings me to the paramterlist_snp_id
. What does it need exactly?How do I proceed?