Open pontikos opened 9 years ago
See correspondence below with ensembl helpdesk:
I think the issue is that some genes, probably entire gene sequences, have been reverse complemented between hg19 and hg38. See for example these 2 matching locations:
chr22:16,449,780-16,449,788 hg19 TGACCTGCA
chr22:15,528,175-15,528,183 hg38 TGCAGGTCA
The hg38 sequence is the reverse complement of the hg19 sequence at the matching location. I can understand that this is part of the general upgrade of the genome, even though it is hard to see how such massive changes could happen. In any case, it makes upgrading population frequency data from one build to another very challenging because the alleles are not always matching anymore.
Reply:
We appreciate that it can be difficult switching to a new genome assembly.
Things will have changed, and this is because the previous assembly was known
to be wrong. The genomes were produced by the Genome Reference Consortium; you
can find them here:
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/
They have further information about what has changed between the assemblies and
why. They also have a blog:
http://genomeref.blogspot.co.uk/2013_12_01_archive.html
We now need to see how many annotations have this problem in ExAC and if we can simply drop them.
I am using the liftOver tool on the chain file ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz The script which does the liftOver is here: https://github.com/vplagnol/pipelines/blob/master/annotation/liftOver.sh
I find that often the allele freq annotation map to the complementary bases. In build 37 we have
But then when we lifover to build 38 we get
G>T
instead ofC>A
: