hoffmangroup / umap

25 stars 2 forks source link

Inconsistencies between GRCh38 BedGraph download from HoffmanLab and bedgraph created from UCSC bigWig file #21

Open toddajohnson opened 2 months ago

toddajohnson commented 2 months ago

My end goal is to annotate a VCF file with mappability scores, but in spot checking annotations for some mutation sites with the UCSC Genome Browser, I noticed some inconsistencies. Specifically, I annotated variants using bcftools annotate with a slightly modified (header removed) BedGraph file https://bismap.hoffmanlab.org/raw/hg38/k100.umap.bedgraph.gz. The variant that I first noticed was at chr1:1641723, for which multi-read k100 UMAP value as 1.0, but UCSC's value was 0.18.

So, to check this further, I downloaded the following, and subsetted data for chr1.

wget https://bismap.hoffmanlab.org/raw/hg38/k100.umap.bedgraph.gz wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph wget https://hgdownload.soe.ucsc.edu/gbdb/hg38/hoffmanMappability/k100.Umap.MultiTrackMappability.bw

bigWigToBedGraph -chrom=chr1 k100.Umap.MultiTrackMappability.bw k100.Umap.MultiTrackMappability.chr1.bedgraph

I imported the BW->BG file k100.Umap.MultiTrackMappability.chr1.bedgraph and the k100.umap.bedgraph.gz into R and merged the tables on CHROM, FROM, and TO. Note, to line up the tables, the FROM and TO values in the raw BedGraph file were one base higher than the matching rows in the file output by bigWigToBedGraph, so I subtracted 1 from FROM and TO for merging tables. That seems a little odd to me, as it seems like both should have consistent start and end positions.

I thought originally that each row should match one in the other file, but only 57.9% matched exactly:

The plots below compare the values between the two files around the above mutation site. I added status columns to show the sources of data for each row and the length of the regions encoded by a particular row (end with ".l"). Notably, from row 18, the raw BedGraph has a value of 1.0 for the 86 basepair region chr1:1641640-1641726 that includes chr1:1641723, but the BW->BG file has various values for those positions that continue down past 1641726 to position 1641739.

Also, if we line up the BW->BG values down from row 19 and the raw BG file's values from row 104 (from position 1641726), the values are equal until position 1642025, suggesting that something occurred in creating the raw BedGraph to shift the original values down.

chr1_top_rows

chr1_bottom_rows

mehrankr commented 2 months ago

Thank you @toddajohnson for raising this.

For all intents and purposes, please use the files already deposited to UCSC Genome Browser as they have gone through multiple checks.

As your results suggest, there seems to be something wrong with the bedgraph file hosted on our website.

If you want to double check the accuracy of the files, the main reference would be the the .unique.gz files. This repository contains scripts for processing the content of these arrays and convert them to unique/multi-mappability measurements.

Unfortunately I do not have access to our server, so addressing this might take some time. I know @EricR86 is improving the code base and once done, I will ask him to regenerate all of the files and add checks to ensure the accuracy of the file contents.

I apologize for the inconsistency.

Best, Mehran

EricR86 commented 2 months ago

I agree using the UCSC as the source when possible and if you wish to regenerate the resulting bedGraph/wig files you should use the existing .unique files to convert it yourself when possible.

There is a significant update to address these issues in the works and is currently undergoing some significant testing.