Script to re-key liftover tables by hg37 locus and alleles

So I've been experimenting with using the gnomAD liftover tables to do a direct mapping between the HG37 loci/alleles and the HG38 loci/alleles.

TLDR;

This script will take the full exome and genome liftover sites hail tables from gnomAD and reverse their mapping from hg38->hg37 TO hg37->hg38 by keying the tables by the hg37 locus and alleles fields, as well as removing the extra fields to save some disk space.

These tables are really big and difficult to work with, but with some preprocessing we might be able to store local copies that can be used to liftover coordinates as needed.

We start by importing the exome and genome liftover tables and drop all unnecessary fields, keeping only the locus, alleles, original_locus and original_alleles fields. The full list of global and row fields can be seen here - Google drive link.

We then re-key the tables by the hg37 locus and alleles. The reason for this is that Hail tables can only join on their key fields. So if we want to do the liftover mapping for hg37 loci / alleles, we need the liftover table to be keyed by these fields.

Re-keying the table seems to be computationally expensive, and if we are to write a re-keyed table to disk it would be handy to only have to do it once. So the decision of what fields to keep might be important as we might not want to run this script more than we have to.

If the script were run in its current state, the output ht might look something like this:

>>> hg37_exome_liftover_table.show(2)

locus	alleles	original_locus	original_alleles
`locus<GRCh38>`	`array<str>`	`locus<GRCh37>`	`array<str>`
chr1:12198	["G","C"]	1:12198	["G","C"]
chr1:12237	["G","A"]	1:12237	["G","A"]

Would these fields be sufficient for all use cases? Or is there more information in the other fields from the original gnomAD tables that may be useful to keep?

populationgenomics / variant-curation-portal

Script to re-key liftover tables by hg37 locus and alleles #42

TLDR;