plagnollab / DNASeq_pipeline

Pipeline in place at the UGI for DNA level analysis
10 stars 8 forks source link

liftOver build 37 to build 38 #25

Open pontikos opened 9 years ago

pontikos commented 9 years ago

I am using the liftOver tool on the chain file ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz The script which does the liftOver is here: https://github.com/vplagnol/pipelines/blob/master/annotation/liftOver.sh

I find that often the allele freq annotation map to the complementary bases. In build 37 we have

22  16449783    16449784    C>A:0.09770115  C   A   .

But then when we lifover to build 38 we get G>T instead of C>A :

chr22   15528179        .       G       T       3775.47 .       BaseQRankSum=2.15;ClippingRankSum=0.356;DP=438;FS=3.963;InbreedingCoeff=-0.9200;MLEAC=12;MLEAF=0.500;MQ=39.03;MQ0=0;MQRankSum=1.13;QD=8.80;ReadPosRankSum=0.349;CSQ=T|ENSG00000130538|ENST00000252835|Transcript|missense_variant|22|21|7|Q/H|caG/caT|rs2259944&rs200562384&COSM1484041||1|OR11H1|HGNC|HGNC:15404|YES|deleterious(0.02)|possibly_damaging(0.548)|||||||||0&0&1|Neutral(0.979)|deleterious(0.569);EXAC_OTH=C>A:0.30769231;EXAC_NFE=C>A:0.37325905;EXAC_AMR=C>A:0.30487805;EXAC_Adj=C>A:0.32632541;EXAC_AFR=C>A:0.09770115;EXAC_SAS=C>A:0.30555556;EXAC_FIN=C>A:0.50000000;EXAC_EAS=C>A:0.40697674        GT:AD:DP:GQ:PL  0/1:25,16:41:99:395,0,684       0/1:29,9:38:99:153,0,782        0/1:24,14:38:99:340,0,682       0/1:25,13:38:99:305,0,660       0/1:20,4:24:65:65,0,531 0/1:17,9:26:99:210,0,450        0/0:8,0:8:0:0,0,184     0/1:5,8:13:99:188,0,126 0/1:45,15:60:99:336,0,1217      0/1:18,34:52:99:1000,0,387      0/1:22,19:41:99:467,0,595       0/1:41,17:58:99:367,0,1099
pontikos commented 9 years ago

See correspondence below with ensembl helpdesk:

I think the issue is that some genes, probably entire gene sequences, have been reverse complemented between hg19 and hg38. See for example these 2 matching locations:

chr22:16,449,780-16,449,788   hg19 TGACCTGCA
chr22:15,528,175-15,528,183  hg38 TGCAGGTCA

The hg38 sequence is the reverse complement of the hg19 sequence at the matching location. I can understand that this is part of the general upgrade of the genome, even though it is hard to see how such massive changes could happen. In any case, it makes upgrading population frequency data from one build to another very challenging because the alleles are not always matching anymore.

Reply:

We appreciate that it can be difficult switching to a new genome assembly.
Things will have changed, and this is because the previous assembly was known
to be wrong. The genomes were produced by the Genome Reference Consortium; you
can find them here:
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/

They have further information about what has changed between the assemblies and
why. They also have a blog:
http://genomeref.blogspot.co.uk/2013_12_01_archive.html

We now need to see how many annotations have this problem in ExAC and if we can simply drop them.