lawremi / rtracklayer

R interface to genome annotation files and the UCSC genome browser
Other
28 stars 17 forks source link

seqnames, ranges and strand cannot be extracted from gtf #51

Closed Shicheng-Guo closed 3 years ago

Shicheng-Guo commented 3 years ago

Here is what I found: seqnames, ranges and strand cannot be extracted from gtf. Any suggestion?

gtf <- rtracklayer::import('Homo_sapiens.GRCh38.103.gtf')
input <- gtf[gtf$type == "gene",]
seqnames<-gtf$seqnames
ranges<-gtf$ranges
gene_name<-gtf$gene_name
gene_id<-gtf$gene_id
transcript_id<-gtf$transcript_id
strand<-gtf$strand
length(seqnames)
length(ranges)
length(gene_name)
length(gene_id)
length(transcript_id)
length(strand)
input<-data.frame(seqnames,ranges,gene_name,gene_id,transcript_id,strand)

> length(seqnames)
[1] 0
> length(ranges)
[1] 0
> length(gene_name)
[1] 3074360
> length(gene_id)
[1] 3074360
> length(transcript_id)
[1] 3074360
> length(strand)
[1] 0

interesting, 60666 items only have gene ID without transcript ID, why?

> input<-data.frame(gene_name,gene_id,transcript_id)
> dim(input)
[1] 3074360       3
> input<-na.omit(input)
> dim(input)
[1] 3013694       3
lawremi commented 3 years ago

import() returns a GRanges object, so please check its documentation for retrieving those files (using accessors).