jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

Discrepancy between UCSC/NCBI and ensembldb in genome to transcript mapping #90

Closed jorainer closed 5 years ago

jorainer commented 5 years ago

Following problem:

library(ensembldb)
library(EnsDb.Hsapiens.v86)
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "16")
gnm <- GRanges("16:90000528-90000528")
gnm_tx <- genomeToTranscript(gnm, edbx)
print(gnm_tx)

the output is

IRangesList of length 1
[[1]]
IRanges object with 4 ranges and 6 metadata columns:
                     start       end     width |
                 <integer> <integer> <integer> |
 ENST00000355531      1304      1304         1 |
 ENST00000388970      1441      1441         1 |
 ENST00000454997      1694      1694         1 |
 ENST00000557444      2307      2307         1 |
                         exon_id exon_rank seq_start
                     <character> <integer> <integer>
 ENST00000355531 ENSE00001644225         7  90000528
 ENST00000388970 ENSE00001644225        12  90000528
 ENST00000454997 ENSE00001735108         3  90000528
 ENST00000557444 ENSE00002527808        17  90000528
                   seq_end    seq_name  seq_strand
                 <integer> <character> <character>
 ENST00000355531  90000528          16           *
 ENST00000388970  90000528          16           *
 ENST00000454997  90000528          16           *
 ENST00000557444  90000528          16           *

Accoring to ensembldb the mapping is to position 1441, but the sequence of the positions 16:90000518-90000538 in UCSC is CCCTTCTCGGAGAAGTCTACC. The position of this sequence in the transcript NR_003228.1 (corresponds to ENST00000388970) is however 1682.

What is now the reason of the difference in the reported positions?

jorainer commented 5 years ago

Checking the position of the 30nt sequence stretch within transcripts:

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
gpos <- GRanges("16:90000518-90000538")
txpos <- genomeToTranscript(gpos, edb)
IRangesList of length 1
[[1]]
IRanges object with 4 ranges and 6 metadata columns:
                      start       end     width |         exon_id exon_rank
                  <integer> <integer> <integer> |     <character> <integer>
  ENST00000355531      1294      1314        21 | ENSE00001644225         7
  ENST00000388970      1431      1451        21 | ENSE00001644225        12
  ENST00000454997      1684      1704        21 | ENSE00001735108         3
  ENST00000557444      2297      2317        21 | ENSE00002527808        17
                  seq_start   seq_end    seq_name  seq_strand
                  <integer> <integer> <character> <character>
  ENST00000355531  90000518  90000538          16           *
  ENST00000388970  90000518  90000538          16           *
  ENST00000454997  90000518  90000538          16           *
  ENST00000557444  90000518  90000538          16           *

So, we find the same positions reported above. Now let's get the DNA sequence of the respective genomic region. getGenomeFaFile will retrieve the DNA sequence (as a FA file) for the matching genome release from AnnotationHub.

dna <- getGenomeFaFile(edb)
snapshotDate(): 2018-10-24
Returning the Fasta file for Ensembl version 81 since no file for Ensembl version 86 is available.
downloading 0 resources
loading from cache 
    '/Users/jo//.AnnotationHub/55651'
    '/Users/jo//.AnnotationHub/55652'

getSeq(dna, gpos)
  A DNAStringSet instance of length 1
    width seq                                               names               
[1]    21 CCCTTCTCGGAGAAGTCTACC                             16

This sequence matches the sequence mentioned in the previous post. Next we extract the transcript model for ENST00000388970 (i.e. the genomic position of its exons) and use extractTranscriptSeqs to retrieve its sequence. From this we extract the subsequence matching the positions of the genomic region within the transcript:

exns <- exonsBy(edb, filter = ~ tx_id == "ENST00000388970")
txseq <- extractTranscriptSeqs(dna, exns)
subseq(txseq, 1431, 1451)
  A DNAStringSet instance of length 1
    width seq                                               names               
[1]    21 CCCTTCTCGGAGAAGTCTACC                             ENST00000388970

So, the sequence within the transcript matches the one from the genome. Also, if we look up the sequence in the ensembl genome browser we get the same result: http://www.ensembl.org/Homo_sapiens/Location/View?r=16%3A90000518-90000538

The question now is why is the position within NR_003228.1 different from the one within ENST00000388970. For this we download the fasta file from https://www.ncbi.nlm.nih.gov/nuccore/NR_003228.1?report=fasta&log$=seqview&format=text

Next we load the sequence from the fasta file

NCBI_seq <- rtracklayer::import(fafile)
NCBI_seq
  A DNAStringSet instance of length 1
    width seq                                               names               
[1]  1941 TCTCCCATTGCGCATGCGCGCGC...CAAATTAAAACGTGACAACTGTG NR_003228.1 Homo ...

It seems that the ENST00000388970 does not completely match NR_003228.1 as their sequence lengths differ:

width(NCBI_seq)
[1] 1941
width(txseq)
[1] 1466

Also, the sequences are apparently different (see also http://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000223959;r=16:89977694-90000553;t=ENST00000388970):

NCBI_seq
  A DNAStringSet instance of length 1
    width seq                                               names               
[1]  1941 TCTCCCATTGCGCATGCGCGCGC...CAAATTAAAACGTGACAACTGTG NR_003228.1 Homo ...

txseq
  A DNAStringSet instance of length 1
    width seq                                               names               
[1]  1466 TCATCTGGCCTCCTGTTCCAGTG...AGTCTACCTATGAGGAGTTTGTA ENST00000388970

The reason for the difference in the mapping is thus that the two transcripts (Ensembl and NCBI) are not identical.