macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
Other
122 stars 55 forks source link

Improper handling of pseudoautosomal region #39

Closed ManavalanG closed 6 years ago

ManavalanG commented 7 years ago

File clinvar_alleles.single.b38.tsv.gz has 43 duplicate rows. Here are the allele IDs for those rows: [231307, 150500, 24917, 260558, 150301, 38949, 45434, 45436, 167501, 24912, 190850, 24911, 24915, 257893, 231306, 260559, 25394, 178250, 24914, 178251, 178253, 260556, 189165, 189166, 260560, 45437, 178249, 137687, 231308, 137685, 150305, 99001, 137686, 137689, 150303, 361265, 99002, 150304, 98999, 38950, 23109, 195246, 137688]

On closer look, a subset of them are in pseudoautosomal region (ie. one row should have had chrX and the other chrY). Allele IDs 231307 and 24915 are part of that subset.

ps - At this point, I'm just nitpicking when reporting issues :)

ManavalanG commented 6 years ago

Allele IDs 24914 and 24911 in pseudoautosomal region should be mapped to both chrX and chrY, but here they are mapped only to chrX. It's possible that this happens to all variants in pseudoautosomal region.