hpages / SNPlocsForge

Tools for building SNPlocs data packages
2 stars 0 forks source link

High proportion of non-biallelic SNPs in dbSNP 155 vs 144 #1

Closed Al-Murphy closed 1 year ago

Al-Murphy commented 2 years ago

We have noted a substantial increase the proportion of non-biallelic SNPs across dbSNP 144 and 155 (increased from ~1% to 24%). This is explained in detail here and summarised in the code below (fuicntion written by @hpages).

Note that this is based on the dbSNP versions in Bioconductor (SNPlocs.Hsapiens.dbSNP) so we would like to confirm this aligns with dbSNP versions outside of Biocnductor but there doesn't seem to be any numbers available online about it. The only piece of work about this is quite old now and doesn't give specific numbers SNPs that come in threes. It would be great if anyone has any insight on this.

#wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144.txt.gz"
#gunzip snp144.txt.gz
snps <- limma::read.columns("~/Downloads/snp144.txt", 
                            required.col=c("rs775809821"))$rs775809821
#missing header so add in SNP
snps <- c("rs775809821",snps)

library(BSgenome)
extractAlleles <- function(rsids, dbSNP, ref_genome="GRCh38")
{
  pkgname <- paste0("SNPlocs.Hsapiens.dbSNP", dbSNP, ".", ref_genome)
  pkgenv <- loadNamespace(pkgname)
  snplocs <- get(pkgname, envir=pkgenv, inherits=FALSE)
  snps <- snpsById(snplocs, rsids, ifnotfound="drop")
  unname(setNames(IUPAC_CODE_MAP[mcols(snps)$alleles_as_ambig],
                  mcols(snps)$RefSNP_id)[rsids])
}

dbSNP144_alleles <- extractAlleles(snps, 144)
dbSNP155_alleles <- extractAlleles(snps, 155)
alleles <- data.frame(rsid=snps, dbSNP144_alleles, dbSNP155_alleles)
alleles <- data.table::setDT(alleles)
#remove where alleles missing for either
alleles <- alleles[!(is.na(dbSNP144_alleles) | is.na(dbSNP155_alleles))]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)==2,
        match:="both biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)>2,
        match:="both non-biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)==2,
        match:="dbSNP 144 non-biallelic"]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)>2,
        match:="dbSNP 155 non-biallelic"]
summ <- alleles[,.N,by=match]
summ[,prop:=N/sum(summ$N)]
#> summ
#match         N         prop
#1:          both biallelic 100605258 0.7427438640
#2: dbSNP 155 non-biallelic  32266661 0.2382168183
#3:      both non-biallelic   2565775 0.0189424855
#4: dbSNP 144 non-biallelic     13116 0.0000968322

And just to visually represent them:

library(ggplot2)
library(cowplot)
pastel_cols <- c("#9A8822","#F5CDB4","#F8AFA8",
                 "#FDDDA0","#74A089","#85D4E3",
                 #added extra to make 7
                 '#78A2CC')
ggplot(summ, aes(x = factor(match,levels=rev(c("both biallelic",
                                           "both non-biallelic",
                                          "dbSNP 155 non-biallelic",
                                          "dbSNP 144 non-biallelic"))), y = N))+
  geom_bar(stat = "identity", fill = pastel_cols[c(2,3,2,2)])+
  geom_text(label = with(summ, paste(prettyNum(N,big.mark=",",scientific=FALSE), 
                                     paste0('\n (', round(prop*100,2), '%)'))), 
            hjust = -.3)+
  ylim(0,max(summ$N)+20000000)+
  coord_flip() + 
  theme_cowplot()+
  theme(axis.title.y=element_blank())

image

hpages commented 1 year ago

Is it ok to close this @Al-Murphy? IIUC this doesn't seem to be an issue with the latest SNPlocs packages.

Al-Murphy commented 1 year ago

that's all good with me Herve, thanks!