Closed Tim-Yu closed 2 years ago
Could you please elaborate on that? What tags are you referring to?
Thanks for your reply. I mean the transcripts flags as shown below.
I found you might not have included the transcript support level, MANE Select during the building of the EnsDbSQLite database. It will be great if you could include these tags~
Thanks for the clarification. I will have a look into this. Adding new content is always a little tricky. But note that the transcript support level is already available in column "tx_support_level"
for transcripts:
> library(AnnotationHub)
> ah <- AnnotationHub()
query(ah, "EnsDb.Hsapiens.v100")
AnnotationHub with 1 record
# snapshotDate(): 2021-01-14
# names(): AH79689
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2020-04-27
# $title: Ensembl 100 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("100", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
# "Protein", "Transcript")
# retrieve record with 'object[["AH79689"]]'
> edb <- ah[["AH79689"]]
downloading 1 resources
retrieving 1 resource
|======================================================================| 100%
loading from cache
> transcripts(edb)
GRanges object with 250776 ranges and 9 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <character>
ENST00000456328 1 11869-14409 + | ENST00000456328
ENST00000450305 1 12010-13670 + | ENST00000450305
... ... ... ... . ...
ENST00000435741 Y 26626520-26627159 - | ENST00000435741
ENST00000431853 Y 56855244-56855488 + | ENST00000431853
tx_biotype tx_cds_seq_start tx_cds_seq_end
<character> <integer> <integer>
ENST00000456328 processed_transcript <NA> <NA>
ENST00000450305 transcribed_unproces.. <NA> <NA>
... ... ... ...
ENST00000435741 processed_pseudogene <NA> <NA>
ENST00000431853 processed_pseudogene <NA> <NA>
gene_id tx_support_level tx_id_version gc_content
<character> <integer> <character> <numeric>
ENST00000456328 ENSG00000223972 1 ENST00000456328.2 55.3410
ENST00000450305 ENSG00000223972 <NA> ENST00000450305.2 58.0696
... ... ... ... ...
ENST00000435741 ENSG00000231514 <NA> ENST00000435741.1 55.1562
ENST00000431853 ENSG00000235857 <NA> ENST00000431853.1 53.0612
tx_name
<character>
ENST00000456328 ENST00000456328
ENST00000450305 ENST00000450305
... ...
ENST00000435741 ENST00000435741
ENST00000431853 ENST00000431853
-------
seqinfo: 454 sequences from GRCh38 genome
>
Thanks to your code, I finally figured out that my 'query style' only returned UCSC version which apparently does not include too much metadata....
> ah <- AnnotationHub() %>% query(c("Homo sapiens","GTF","100","Ensembl"))
snapshotDate(): 2020-10-27
> gtf <- ah[4]
AnnotationHub with 1 record
# snapshotDate(): 2020-10-27
# names(): AH5015
# $dataprovider: UCSC
# $species: Homo sapiens
# $rdataclass: GRanges
# $rdatadateadded: 2013-03-26
# $title: Recomb Rate
# $description: GRanges object from UCSC track 'Recomb Rate'
# $taxonomyid: 9606
# $genome: hg19
# $sourcetype: UCSC track
# $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/recombRate
# $sourcesize: NA
# $tags: c("recombRate", "UCSC", "track", "Gene", "Transcript", "Annotation")
# retrieve record with 'object[["AH5015"]]'
> DbFile <- ensDbFromAH(gtf)
> edb <- EnsDb(DbFile)
> keytypes(edb)
[1] "ENTREZID" "EXONID" "GENEBIOTYPE" "GENEID" "GENENAME" "SEQNAME"
[7] "SEQSTRAND" "SYMBOL" "TXBIOTYPE" "TXID" "TXNAME"
Ah, yes. I would suggest you use the pre-built EnsDb
database I distribute via AnnotationHub
. I build them for every new Ensembl release (currently running the code for Ensembl 103) and they contain more annotations than what you can get from a GTF or GFF file.
As a followup for this, how does one filter to include NA records?
I tried a few guesses, including:
CodingTranscripts_with_TSL_NA<- transcripts(ahEdb, filter=list(GeneBiotypeFilter("protein_coding"), TxSupportLevelFilter(is.na(value) == T) )
I'm afraid that is not possible. The filters are internally translated to SQL calls and we did not foresee to filter also for missing values.
As a workaround you could do the additional subsetting by tx support level in R:
tx <- transcripts(edb, filter = ~ gene_biotype == "protein_coding")
tx <- tx[is.na(tx$tx_support_level)]
@jorainer I think I actually got it! If one changes the "condition" to "!=" - or not equal. I didnt understand how to change the condition at first. Thank you.
Granges_protein_coding_transcripts_TSL_1_or_NA <- transcripts(ahEdb, filter=list(GeneBiotypeFilter("protein_coding"), TxSupportLevelFilter(value = c(2,3,4,5), condition = "!=" )))
Hi, it would be great if there is a filter on the tags. BW