hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
188 stars 58 forks source link

Question about mappability bed and downstream filtering #311

Closed toddajohnson closed 1 year ago

toddajohnson commented 2 years ago

Since I started to use HMF's programs, I have been annotating the variants with the mappability score, but, stupid me, I just realized that I never actually used it anywhere for downstream filtering. I have searched through the SAGE/PAVE/PURPLE README files, but I do not see any suggestion as to downstream filtering using the MAPPABILITY score in the VCF files. Also, what is the source of the mappability scores, and is it filtered for just genic regions? I ask because In checking through areas around some germline variants that I just had called and annotated but that lacked MAPPABILITY annotation, it seems that some variants are not in regions present in the bed file, even though they show up as in highly mappable regions in the UCSC genome browser. For instance, chr2:150,220,696-150,220,696 has UMAP K100 probability of 1 in UCSC, but there are no annotation entries within about 5k below and 20 k above that position in the bed. Also, just spot checking some nearby entries, and the data seems to differ. chr2:150240156 has score = 0.0145 in the bed, but probability of 1 in the k100 tract.

p-priestley commented 2 years ago

Hi Todd - sorry for the really delayed response. To quickly answer your questions:

The mappability is for information purposes only

To construct the mappability bed file, we assessed mappability based on 2 errors in a 150 base region centred on the location.