Closed jorainer closed 5 years ago
Checking the position of the 30nt sequence stretch within transcripts:
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
gpos <- GRanges("16:90000518-90000538")
txpos <- genomeToTranscript(gpos, edb)
IRangesList of length 1
[[1]]
IRanges object with 4 ranges and 6 metadata columns:
start end width | exon_id exon_rank
<integer> <integer> <integer> | <character> <integer>
ENST00000355531 1294 1314 21 | ENSE00001644225 7
ENST00000388970 1431 1451 21 | ENSE00001644225 12
ENST00000454997 1684 1704 21 | ENSE00001735108 3
ENST00000557444 2297 2317 21 | ENSE00002527808 17
seq_start seq_end seq_name seq_strand
<integer> <integer> <character> <character>
ENST00000355531 90000518 90000538 16 *
ENST00000388970 90000518 90000538 16 *
ENST00000454997 90000518 90000538 16 *
ENST00000557444 90000518 90000538 16 *
So, we find the same positions reported above. Now let's get the DNA sequence of the respective genomic region. getGenomeFaFile
will retrieve the DNA sequence (as a FA file) for the matching genome release from AnnotationHub
.
dna <- getGenomeFaFile(edb)
snapshotDate(): 2018-10-24
Returning the Fasta file for Ensembl version 81 since no file for Ensembl version 86 is available.
downloading 0 resources
loading from cache
'/Users/jo//.AnnotationHub/55651'
'/Users/jo//.AnnotationHub/55652'
getSeq(dna, gpos)
A DNAStringSet instance of length 1
width seq names
[1] 21 CCCTTCTCGGAGAAGTCTACC 16
This sequence matches the sequence mentioned in the previous post. Next we extract the transcript model for ENST00000388970
(i.e. the genomic position of its exons) and use extractTranscriptSeqs
to retrieve its sequence. From this we extract the subsequence matching the positions of the genomic region within the transcript:
exns <- exonsBy(edb, filter = ~ tx_id == "ENST00000388970")
txseq <- extractTranscriptSeqs(dna, exns)
subseq(txseq, 1431, 1451)
A DNAStringSet instance of length 1
width seq names
[1] 21 CCCTTCTCGGAGAAGTCTACC ENST00000388970
So, the sequence within the transcript matches the one from the genome. Also, if we look up the sequence in the ensembl genome browser we get the same result: http://www.ensembl.org/Homo_sapiens/Location/View?r=16%3A90000518-90000538
The question now is why is the position within NR_003228.1 different from the one within ENST00000388970. For this we download the fasta file from https://www.ncbi.nlm.nih.gov/nuccore/NR_003228.1?report=fasta&log$=seqview&format=text
Next we load the sequence from the fasta file
NCBI_seq <- rtracklayer::import(fafile)
NCBI_seq
A DNAStringSet instance of length 1
width seq names
[1] 1941 TCTCCCATTGCGCATGCGCGCGC...CAAATTAAAACGTGACAACTGTG NR_003228.1 Homo ...
It seems that the ENST00000388970 does not completely match NR_003228.1 as their sequence lengths differ:
width(NCBI_seq)
[1] 1941
width(txseq)
[1] 1466
Also, the sequences are apparently different (see also http://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000223959;r=16:89977694-90000553;t=ENST00000388970):
NCBI_seq
A DNAStringSet instance of length 1
width seq names
[1] 1941 TCTCCCATTGCGCATGCGCGCGC...CAAATTAAAACGTGACAACTGTG NR_003228.1 Homo ...
txseq
A DNAStringSet instance of length 1
width seq names
[1] 1466 TCATCTGGCCTCCTGTTCCAGTG...AGTCTACCTATGAGGAGTTTGTA ENST00000388970
The reason for the difference in the mapping is thus that the two transcripts (Ensembl and NCBI) are not identical.
Following problem:
Accoring to ensembldb the mapping is to position 1441, but the sequence of the positions 16:90000518-90000538 in UCSC is
CCCTTCTCGGAGAAGTCTACC
. The position of this sequence in the transcript NR_003228.1 (corresponds to ENST00000388970) is however 1682.What is now the reason of the difference in the reported positions?