privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Can I use the external hg38 genetic_map file directly instead of using snp_asGeneticPos for interpolation? #447

Closed zbjbiubiubiu closed 8 months ago

zbjbiubiubiu commented 9 months ago

Hello, My GWAS and plink files are both in hg38 format. Even if I use the rsid in snp_asGeneticPos for matching and interpolation to get the genetic distance, I still cannot use this function and will report the error task 1 failed - "'y' must be increasing or decreasing". I found from the issue of https://github.com/joepickrell/1000-genomes-genetic-maps/ that someone has provided the SNP genetic distance of hg38 (https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz), so I download this file and match my plink file with this file based on chr:postion, and directly use the Genetic_Map (cM) in the file as the genetic distance of the corresponding SNP in my plink file. Is this method feasible? In fact, I also ran through the LDpred2 process using this method, but I am not sure whether it is feasible and reliable to add genetic distance in this way?

privefl commented 9 months ago
info <- bigreadr::fread2("tmp-data/plinkfile_match_OMNI/plinkfile_match_OMNI.txt")

If you install the latest GitHub version (v1.12.5), you can do

library(bigsnpr)
genetic_map <- download_genetic_map("hg38_price", dir = "tmp-data")
snp_asGeneticPos2(info$chr, info$pos, genetic_map)

You can also directly download a genetic map at https://alkesgroup.broadinstitute.org/Eagle/downloads/tables/

genetic_map <- bigreadr::fread2(
  "https://storage.googleapis.com/broad-alkesgroup-public/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz", 
  select = c(1, 2, 4), col.names = c("chr", "pos", "pos_cM"))

snp_asGeneticPos2(info$chr, info$pos, genetic_map)
privefl commented 8 months ago

Any update on this?

zbjbiubiubiu commented 8 months ago

Yes, using your newly provided method, you can add genetic distance to each SNP, instead of only taking the intersection like I did before, thus discarding some SNPs. But after I use 'snp_asGeneticPos2(info$chr, info$pos, genetic_map)' to add the position to each SNP, when calculating the LD score later, an error 'address 0x7eecc4000000, cause 'memory not mapped' will be reported. I'm not sure if it's because there are too many SNPs (about 6.9 million SNPs, resulting in insufficient running memory), because I only use the SNPs and genetic distances obtained by taking the intersection and it can run normally. Anyway, this method is feasible, thank you for your timely update

zbjbiubiubiu commented 8 months ago

By the way, my error message is as follows:

caught segfault address 0x7eecc4000000, cause 'memory not mapped'

Traceback: 1: (if (symmetric) range_col_sym else range_col)(spmat@p, spmat@i) 2: .Object$initialize(...) 3: initialize(value, ...) 4: initialize(value, ...) 5: methods::new("SFBM_compact", spmat = spmat, backingfile = backingfile) 6: as_SFBM(corr0, tmp, compact = TRUE) An irrecoverable exception occurred. R is aborting now ... ./run.sh: line 1: 27865 Segmentation fault (core dumped) Rscript ./LDpred2_add_CM_by_linear_interpolation.r

This should not be a code problem. I guess it may be that my data is too large or a certain software package has a calculation overflow and needs to be updated. Because I use the intersection of my SNPs with the external hg38 genetic_map file, it works fine.

privefl commented 8 months ago

Could you open another issue to report this other problem please?