privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Loading Chromosome X data (nonPAR) using bigsnpr #479

Closed JasperHof closed 3 months ago

JasperHof commented 4 months ago

Hi Florian,

I am trying to use bigsnpr for ChrX data (non-PAR region). However, when trying to load the data, I encounter an error in the snp_readBGEN function that I've not seen here before: terminate called after throwing an instance of 'Rcpp::exception' what(): Only 2 alleles allowed.

I am able to load other information on the genotype data (see picture). The 'infos' dataframe suggests that all SNPs are biallelic, but still I get the error that only 2 alleles are allowed.

image

Does this error ring a bell, or do you have suggestions for possible solutions? Thanks in advance! Best,

Jasper

privefl commented 4 months ago

I've never seen this error before, but I do remember putting this assertion in the code. From what I remember, the BGEN format allows for storing multi-allelic variants, which my code does not handle. Is it the UKBB data? If so, please send a few lines of code for me to try to reproduce this. Or simply tell me which of these 6 variants causes the issue.

JasperHof commented 4 months ago

Hi Florian, Thanks for the fast reply. This is not UKBB data, but data from our own cohort. I am able to load the first SNP, but none of the other SNPs.. Best, Jasper

privefl commented 4 months ago

Not sure I can help if I cannot reproduce the issue. Do you have the same problem on autosomes or is it just a problem with chrX?

JasperHof commented 4 months ago

Hi Florian,

It is just only a problem for the chrX data, specifically the nonPAR region (the PAR region works fine).

My genotype data was originally imputed to VCF format, which I have recoded to .bgen using qctool. I have already tried different settings in qctool to try to see if that could overcome the problem, but that has not helped yet. It could be that the problem lies in my genotype data, not the bigsnpr software.

Best,

Jasper

privefl commented 4 months ago

Do you have the UKBB data as well to see whether this happens there as well?

JasperHof commented 4 months ago

No I do not unfortunately, I am only working with our own data. I do not know if this problem also presents in UKBB data.

privefl commented 4 months ago

I do not have the X chromosome at the moment; I'll ask to download it and try quickly on it. Are your genome positions in the GRCh37 / hg19 genome build?

JasperHof commented 4 months ago

Hi Florian, Thanks for your efforts! My genotype data is hg38. Best, Jasper

privefl commented 3 months ago

I've finally managed to download the chrX UKBB data. I've tried both https://www.ncbi.nlm.nih.gov/snp/rs28579419 (the 2nd one) and https://www.ncbi.nlm.nih.gov/snp/rs60075487 (the 3rd one), and I got no error when reading from the UKBB data.

Maybe there is an option in qctool to split multi-allelic variants?

JasperHof commented 3 months ago

Hi Florian,

Thanks a lot for trying on UKBB data, and good to hear it works there. I do not think the problem lies in multi-allelic variants, because I was also unable to load some single biallelic variants. I think something else may have gone wrong in my conversion process to the .bgen format, which is causing the problems.

For now I have found another way to get around this problem. Thanks again stilll!

Best,

Jasper

privefl commented 3 months ago

Should we close this issue or keep it open?

JasperHof commented 3 months ago

For me it's good to close!