rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

GRCh38 genetic map #7

Open SABiagini opened 2 years ago

SABiagini commented 2 years ago

Hi, I downloaded the genetic map for GRCh38 used in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7059836/ This genetic map is the one provided by Eagle and can be found here: https://alkesgroup.broadinstitute.org/Eagle/downloads/tables/

When I compared their chromosome 20 with the one provided in the example from QUILT, I found some differences that I can't explain.

These are the first 5 lines of the file you provided with QUILT (CEU-chr20-final.b38.txt.gz):

position COMBINED_rate.cM.Mb. Genetic_Map.cM. 82590 6.80436893751158 0 82603 6.8056503043227 0.0000884567961876505 83158 6.81470108539 0.00386559271508675 88108 6.83273214455531 0.0375983630877673 88453 6.83536710777191 0.0399556556776388

And these are the first 5 lines of the file I downloaded:

chr position COMBINED_rate(cM/Mb) Genetic_Map(cM) 20 81154 0.0000000000 0.000000000000000 20 82590 0.3500036909 0.000502605300132 20 82603 0.3494018702 0.000507147524445 20 83158 0.3501262382 0.000701467586646 20 83509 0.3558643956 0.000826375989502

Focusing on the positions they have in common (e.g., 82590, 82603, and 83158), it is evident that both COMBINED_rate and Genetic_Map are different in the two files and I wonder what these differences are due to and whether both maps are valid or not.

However, if the map you provided with QUILT is the best option, where could I find the map for the entire genome?

Furthermore, since I'm using the HRC panel as reference, would you suggest to use a more inclusive genetic map (i.e, including the positions from multiple populations)? I wonder whether an inconsistency between the genetic map and the reference could possibly represent an issue.

Thank you

rwdavies commented 2 years ago

Hi,

I won't claim the genetic map I provided with the QUILT example package is the best one. There are several genetic maps out there built from different data in different ways that might be more appropriate for any given analysis.

The raw data for the one I provided came from CEU_omni_recombination_20130507.tar Available from e.g. https://www.internationalgenome.org/data-portal/search?q=recombination specifically ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/CEU_omni_recombination_20130507.tar

I then raw liftOver (not entirely sure which version, md5sum 0bec21ac6843b96284ff976953da0a54) using hg19ToHg38.over.chain.gz md5sum 35887f73fe5e2231656504d1f6430900. To create the above formatted file, I had to fill in a few regions, which I did with the R code that's copied and pasted at the bottom of this message (I can't seem to attach a file to this message?). It's not a well written file, but if you can't reproduce it by downloading things and re-configuring paths, I can clean it up, if you want.

About the differences between the two files, yes they look pronounced at the start of the file, but hopefully those differences settle down throughout. More on that in a minute.

One thing, I can't validate the file Alkes provided. I'm not sure I've ever seen a formal definition for these types of files, but I've always used the format defined by files that other people provided. Here if we look at the top of the CEU files for chromosome 20 from the above tarball we see Position(bp) Rate(cM/Mb) Map(cM) Filtered 63231 6.7306764141 0.0 0 63244 6.77932456662 8.74992285188e-05 0 Let g be the genetic map with 3 columns. In 1-based coordinates, we should have that g[iRow, 3] should equal g[iRow - 1, 3] + ( (g[iRow, 1] - g[iRow - 1, 1]) / 1e6 g[iRow - 1, 2]) and indeed for iRow = 2 we have 8.74992285188e-05 = 0 + (63244 - 63231) / 1e6 6.7306764141

Now for Alkes file we have chr position COMBINED_rate(cM/Mb) Genetic_Map(cM) 20 81154 0.0000000000 0.000000000000000 20 82590 0.3500036909 0.000502605300132 20 82603 0.3494018702 0.000507147524445 20 83158 0.3501262382 0.000701467586646 20 83509 0.3558643956 0.000826375989502 Now looking at iRow = 3 we should see 0.000502605300132 = 0.000000000000000 + (82590 - 81154) / 1e6 0.0000000000 but this obviously doesn't hold because the RHS is 0. We do see 0.000502605300132 = 0.000000000000000 + (82590 - 81154) / 1e6 0.3500036909 so I think they're columns are off by 1

Anyway, ignoring that discrepancy, if we go only based on the provided genetic map column

L <- seq(1e6, 10e6, 10000)
robbie <- STITCH::load_validate_and_match_genetic_map(
    genetic_map_file = "/well/davies/shared/recomb/CEU/CEU-chr20-final.b38.txt.gz",
    L = L,
    expRate = 0.5
)
system(paste0("gunzip -c /well/davies/shared/recomb/CEU/alkes/genetic_map_hg38_withX.txt.gz | awk '{if(NR == 1 || $1 == \"20\") {print $0}}' | cut -f2-4 --delimiter=' ' | gzip -1 > /well/davies/shared/reco\
mb/CEU/alkes/genetic_map_hg38_withX.chr20.txt.gz"))
alkes <- match_genetic_map_to_L(
    genetic_map = read.table("/well/davies/shared/recomb/CEU/alkes/genetic_map_hg38_withX.chr20.txt.gz", header = TRUE),
    L = L,
    expRate = expRate
)
cor(diff(robbie), diff(alkes)) ** 2 ## 0.65                                                                                                                                                                   

Using functions from https://github.com/rwdavies/STITCH/blob/master/STITCH/R/genetic-map.R What the above shows is that the genetic distance between sites at a distance of 10kbp is correlated with r2 = 0.65 between the two files, so I would say they're more or less interchangeable.

About your last comment, in general I think a more inclusive map would be a great idea. I was a bit lazy in the QUILT paper to just use a constant panel. An African American map might be a good idea, to get a combination of European and African PRDM9 hotspots. I'm not otherwise sure or knowledgeable of a best "global" map.

I don't think an inconsistency between a map and a reference panel otherwise would be an issue, as long as they're all the same genome build.

Hope that helps, Robbie

library("STITCH")
proj_home <- "~/proj/"
stitch_private_home <- file.path(proj_home, "STITCH/")
dir <- file.path(stitch_private_home, "STITCH", "R")
files <- dir(dir)
a <- grep("~", files)
if (length(a) > 0) {
    files <- files[-a]
}
for(file in files) {source(file.path(dir, file))}

DOWNLOAD_CACHE_DIR <- Sys.getenv("DOWNLOAD_CACHE_DIR")

panel <- "CEU"
chr <- commandArgs(trailingOnly = TRUE)[1]

if (1 == 0) {

    DOWNLOAD_CACHE_DIR <- "/data/smew1/rdavies/external/"
    chr <- "20"

}

recomb_dir <- file.path(DOWNLOAD_CACHE_DIR, "recomb")
dir.create(recomb_dir)
setwd(recomb_dir)

liftOver <- file.path(recomb_dir, "liftOver")
chain <- file.path(recomb_dir, "hg19ToHg38.over.chain.gz")

if (1 == 0) {
    system("wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver")
    system("wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz")
    system("wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/CHB_omni_recombination_20130507.tar")
    system("tar -xvf CHB_omni_recombination_20130507.tar")
}

for(chr in 1:22) {

##
## step 1 - in R, load in rates, output for liftOver
##      
ratesOri <- read.table(
    paste0(recomb_dir, "/", panel, "/", panel, "-", chr, "-final.txt.gz"),
    header = TRUE
)
n <- dim(ratesOri)[1]

## so for example, row 1 = 30, 0.05, 1, row2 = 40, X, 1+(40-30)*0.05/1e6
## interval 1 is 30-39, 30-40 bedFormat
## rate for each interval is 
out <- cbind(
    paste("chr",chr,sep=""),
    ratesOri[-n,1],
    ratesOri[-1,1],
    ratesOri[-n,2],"X","+"
)
##
reformated_file <- paste0(recomb_dir, "/", panel, "/", panel,"-", chr, "_cleaned_cm.hapmapFormat.forLiftOver.txt")
write.table(
    out,
    file = reformated_file,
    row.names = FALSE,
    col.names = FALSE,
    quote = FALSE
)
out_stem <- paste0(recomb_dir, "/", panel, "/", panel, "-", chr, "_b38")

##
## step 2 - run liftOver
##
system(
    paste0(
        liftOver, " ",
        reformated_file, " ",
        chain, " ",
        out_stem, ".txt ",
        out_stem, ".unmapped.txt "
    )
)
system(paste0("gzip -f -1 ", out_stem, ".txt"))
system(paste0("gzip -f -1 ", out_stem, ".unmapped.txt"))

##
## step 3 - read in, fix
##
new <- read.table(paste0(out_stem, ".txt.gz"))
unmapped_file <- paste0(out_stem, ".unmapped.txt.gz")
a <- as.numeric(system(paste("gunzip -c ", unmapped_file, "| wc -l ",sep=""),intern=TRUE))
if (a > 0) {
    un <- read.table(unmapped_file)
} else { 
    un <- NULL
}

## out is the output we sent in mm9 - now resize to "new" mm10
if(is.null(un)==FALSE)  {
    rates <- out[-match(un[,2],out[,2]),]
}
## rates, the original, is now the same size as new
## remove interval if it's changed size
oldIntSize <- as.integer(rates[,3])-as.integer(rates[,2]    )
newIntSize <- as.integer(new[,3])-as.integer(new[,2]    )
## now work with a new matrix - with new mm10 coordinates, has interval
ratesCur <- new[oldIntSize==newIntSize & new[,1]==paste("chr",chr,sep=""),]
## now order by their location. remove if subsequent intervals overlap
ratesCur <- ratesCur[order(ratesCur[,2]),]
n <- dim(ratesCur)[1]    
x <- ratesCur[-n,3]-ratesCur[-1,2]
ratesCur <- ratesCur[c(TRUE,x<=0),]
# now - we have a matrix, ratesCur
# we have the rates in column 4. strand shouldnt matter
# there may be "gaps" - fill in
n=dim(ratesCur)[1]    
x=ratesCur[-n,3]-ratesCur[-1,2]
whichX=(1:length(x))[x<0]    
# now - x<0 implies there is a jump
# for each of these, get average over 50 kb before, after

##
## where this is a gap in the new matrix, fill in using average 25 + and 25 behind
##
addOn <- cbind(as.character(ratesCur[1,1]),ratesCur[c(x<0,FALSE),3],ratesCur[c(FALSE,x<0),2],0.5,"X","+")
##     
## now fill in using 50000 bp before, after
for(i in 1:dim(addOn)[1]) {
    startOfInterval=ratesCur[whichX[1],2]
    endOfInterval=ratesCur[whichX[1],3]
    ## get rate before
    ## we will do this with a while loop - keep adding until done
    j=whichX[i]-1 # interval before
    rateSum=0
    toAdd=25000 # want over 50,000
    while(toAdd>0) {
        if(j==1) {
            averageBefore=rateSum/(25000-toAdd)
            toAdd <- -1
        } else if (j == 0) {
            averageBefore <- 1
            toAdd <- -1
        } else {
            len=ratesCur[j,3]-ratesCur[j,2]
            if((toAdd-len)>0) # add whole thing
            {
                rateSum=rateSum+len*ratesCur[j,4]
                toAdd=toAdd-len
            } else {
                rateSum=rateSum+(25000-toAdd)*ratesCur[j,4]
                ## add fraction
                toAdd=-1
                averageBefore=rateSum/25000
            }
        }
        j=j-1
    }
    ## get rate after
    j=whichX[i]+1 # interval before
    rateSum=0
    toAdd=25000 # want over 50,000
    while(toAdd>0) {
        len=ratesCur[j,3]-ratesCur[j,2]
        if((toAdd-len)>0) # add whole thing
        {
            rateSum=rateSum+len*ratesCur[j,4]
            toAdd=toAdd-len
        } else {
            rateSum=rateSum+(25000-toAdd)*ratesCur[j,4]
            ## add fraction
            toAdd=-1
            averageAfter=rateSum/25000
        }
        j=j+1
        if(j>dim(ratesCur)[1])
        {
            averageAfter=rateSum/(25000-toAdd)
            toAdd=-1
        }
    }
    val <- 0.5*averageBefore +0.5*averageAfter
    if (is.na(val)) {
        print("had to reset a value")
        val <- 0.1
    }
    addOn[i,4]=val
}

## so we want to add end of x<0, start of next x<0
ratesCur=rbind(ratesCur,addOn)
for(i in 2:4) {
    ratesCur[,i]=as.numeric(as.character(ratesCur[,i]))
}
## now check - there should be no discontinuities
ratesCur=ratesCur[order(ratesCur[,2]),]
n=dim(ratesCur)[1]    
x=ratesCur[-n,3]-ratesCur[-1,2]
print(c(chr,"proper build",table(x)))
## now build normal rate map
n=dim(ratesCur)[1]    
recomb=array(0,c(n+1,3))
colnames(recomb)=c("position","COMBINED_rate.cM.Mb.","Genetic_Map.cM.")

## FIXED bug in v1.1
## rates were off by 1 but cumulative sum were correct
recomb[,1]=c(ratesCur[,2],ratesCur[n,3])
recomb[,2]=c(ratesCur[,4],0)
## now make cumulative sum
recomb[-1,3]= recomb[-n,2] * (recomb[-1,1] - recomb[-n,1])/1000000
recomb[,3]=cumsum(recomb[,3])
##f=function(i,recomb)
##{
##jpeg(paste("/data/outbredmice/imputation/rates/rate.mm",i,".chr",chr,".jpg",sep=""))
##plot(recomb[,1],recomb[,3],type="l")
##dev.off()
##}
## plot and inspect visually
##f("9",ratesOri)
##f("10",recomb)

    if (sum(is.na(recomb)) > 0) {
        stop("There are NA in the recomb!")
    }

##
## new - smooth!
##
library("STITCH")
smoothed_rate <- c(rcpp_make_smoothed_rate(
    sigma_rate = recomb[, "COMBINED_rate.cM.Mb."],
    L_grid = recomb[, "position"],
    shuffle_bin_radius = 2000
), 0)
recomb[, "COMBINED_rate.cM.Mb."] <- smoothed_rate
recomb <- fill_in_genetic_map_cm_column(recomb)

## can now output!
## write to table
recomb[,1] <- as.integer(recomb[,1])
options("scipen"=100, "digits"=4)
out_file <- paste0(recomb_dir, "/", panel, "/", panel, "-chr", chr, "-final.b38.txt")
write.table(
    recomb,
    file = out_file,
    row.names=FALSE,col.names=TRUE,quote=FALSE
)

## gzip
system(paste0("gzip -1 -f ", out_file))
## print output as well
print(paste("chr",chr,"pos range - ori",range(ratesOri[,1])[1],range(ratesOri[,1])[2],"new",range(recomb[,1])[1],range(recomb[,1])[2]))
print(paste("chr",chr,"rate range - ori",range(ratesOri[,2])[1],range(ratesOri[,2])[2],"new",range(recomb[,2])[1],range(recomb[,2])[2]))
print(paste("chr",chr,"cumulative range - ori",range(ratesOri[,3])[1],range(ratesOri[,3])[2],"new",range(recomb[,3])[1],range(recomb[,3])[2]))

unlink(paste0(out_stem, ".txt.gz"))
unlink(paste0(out_stem, ".unmappped.txt.gz"))
unlink(reformated_file)

    }
rwdavies commented 2 years ago

I got an email about this including this error message

[2021-10-22 20:25:35] Getting and validating genetic map
Error in FUN(left, right) : non-numeric argument to binary operator
Calls: QUILT ... validate_genetic_map -> Ops.data.frame -> eval -> eval
Execution halted

but don't see this here in github which is weird, I'm wondering if you deleted it?

In any case, if this is still a problem, I tried to check if I could load all of the modified Alkes chromosomes, and this ran without issue (I loaded all the STITCH functions as described above before running the below). So perhaps there is a weird error in your map fixing code? I think that's where the error occurred in what you ran, given the error messages that were printed to screen

get_and_validate_genetic_map <- function(genetic_map_file) {
    ##                                                                                                                                                                                                                                                                                   
    genetic_map <- as.matrix(read.table(genetic_map_file, header = TRUE))
    validate_genetic_map(genetic_map)
    return(genetic_map)
}
for(chr in 1:22) {
    message(paste0(chr, ", ", date()))
    system(paste0("gunzip -c /well/davies/shared/recomb/CEU/alkes/genetic_map_hg38_withX.txt.gz | awk '{if(NR == 1 || $1 == \"", chr, "\") {print $0}}' | cut -f2-4 --delimiter=' ' | gzip -1 > /well/davies/shared/recomb/CEU/alkes/genetic_map_hg38_withX.chr.txt.gz"))
    message("fix")
    temp <- as.matrix(read.table("/well/davies/shared/recomb/CEU/alkes/genetic_map_hg38_withX.chr.txt.gz", header = TRUE))
    temp[, 3] <- fill_in_genetic_map_cm_column(temp)[, 3]
    write.table(temp, paste0("/well/davies/shared/recomb/CEU/alkes/genetic_map_hg38_withX.chr.fixed.txt"))
    system(paste0("gzip -1 -f /well/davies/shared/recomb/CEU/alkes/genetic_map_hg38_withX.chr.fixed.txt"))
    message("try to read in")
    genetic_map <- get_and_validate_genetic_map(paste0("/well/davies/shared/recomb/CEU/alkes/genetic_map_hg38_withX.chr.fixed.txt.gz"))
}
SABiagini commented 2 years ago

Dear Robert,

sorry for the late response. Indeed, I deleted my message since eventually I found a solution and I didn't mean to bother you any further. There was a really stupid problem with my script for recalculating the 3rd column of the genetic map files.

Eventually, I used the Alkes' files after shifting the second column by 1 and recalculating the 3rd column using the formula: g[iRow, 3] = g[iRow - 1, 3] + ( (g[iRow, 1] - g[iRow - 1, 1]) / 1e6 * g[iRow - 1, 2])

Now it seems to be working and, comparing the results for chr20 using the Alkes' map file and the one you provided with QUILT, they are basically the same.

I would like to ask you just one more thing related to the QC filters: In Liu et al. 2018, where STITCH was used, they applied a filter on the info score (greater than 0.4) and on the HWE-pvalue (smaller than 10^-6). I decided to apply the same filters after QUILT and I detected a slight increase in the accuracy results (the accuracy then truly increases when a further filter on the GP scores is applied). Do you think this is a good strategy? Is there any other QC filter you would suggest?

Thank you very much for your support.

Best, Simone

rwdavies commented 2 years ago

Great that you solved the problem already.

re: the filters. In my own experience, the INFO score is usually by far and away the most useful filter. With STITCH (and I don't remember checking this with QUILT, but expect it to hold), the method is very well calibrated, i.e. the posterior genotype probabilities are very accurate, when assessing their calibration over many samples and SNPs. So this INFO score, which measures the spread of the genotype posterior probabilities, captures nearly all the information in how confident the imputation is, and hence how likely the imputation is to be correct. Other things I've checked like HWE aren't nearly as informative (though practically they make a lot of sense, a SNP imputed with an HWE p-value of 1x10^(-200) probably isn't one you want to analyze).

Otherwise nothing comes to mind. If you don't fully trust your reference panel you could combine annotations of the SNPs in the panel from other sources (e.g. mapping related scores, depending on the info you have with your panel) with things like INFO to probably do more additional filtering, but in general it's probably not necessary.

Hope that helps, Robbie

SABiagini commented 2 years ago

Hello Robert, your help was truly helpful and I thank you a lot for it. Can I ask you what's the process of choosing the info score filter value? In Liu et al. 2018 they used 0.4 when working with STITCH but it is not clear the process of selection of that specific value. Thank you. Best wishes, Simone

Youpu-Chen commented 1 year ago

Hi, isn't the dataset from https://www.internationalgenome.org/data-portal/search?q=recombination, already the hg38 version, so why use hg19ToHg38.chain file to liftover?

Otis.

rwdavies commented 1 year ago

It's been a long time since I did this but I just had another look I don't see a README in the tarballs, but I just checked on UCSC b38 chromosome 22 is from 1-50,818,468 b19 chromosome 22 is from 1-51,304,566 the tail of the chromosome 22 file from the tarball includes final data from 51217954 to 51221731 this strongly suggests b19 Let me know if you see a reason to suggest otherwise

For the question I forgot to answer above, about why 0.4. I don't have a great answer, any cutoff is a tradeoff between sensitivity and specificity, and I've often found 0.4 to be a reasonable one