jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

Need for the filter on Tags (TSL:1, MANE Select etc.) #115

Closed Tim-Yu closed 2 years ago

Tim-Yu commented 3 years ago

Hi, it would be great if there is a filter on the tags. BW

jorainer commented 3 years ago

Could you please elaborate on that? What tags are you referring to?

Tim-Yu commented 3 years ago

Thanks for your reply. I mean the transcripts flags as shown below. image

I found you might not have included the transcript support level, MANE Select during the building of the EnsDbSQLite database. It will be great if you could include these tags~

jorainer commented 3 years ago

Thanks for the clarification. I will have a look into this. Adding new content is always a little tricky. But note that the transcript support level is already available in column "tx_support_level" for transcripts:

> library(AnnotationHub)
> ah <- AnnotationHub()
query(ah, "EnsDb.Hsapiens.v100")
AnnotationHub with 1 record
# snapshotDate(): 2021-01-14
# names(): AH79689
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2020-04-27
# $title: Ensembl 100 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("100", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH79689"]]' 
> edb <- ah[["AH79689"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
> transcripts(edb)
GRanges object with 250776 ranges and 9 metadata columns:
                  seqnames            ranges strand |           tx_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENST00000456328        1       11869-14409      + | ENST00000456328
  ENST00000450305        1       12010-13670      + | ENST00000450305
              ...      ...               ...    ... .             ...
  ENST00000435741        Y 26626520-26627159      - | ENST00000435741
  ENST00000431853        Y 56855244-56855488      + | ENST00000431853
                              tx_biotype tx_cds_seq_start tx_cds_seq_end
                             <character>        <integer>      <integer>
  ENST00000456328   processed_transcript             <NA>           <NA>
  ENST00000450305 transcribed_unproces..             <NA>           <NA>
              ...                    ...              ...            ...
  ENST00000435741   processed_pseudogene             <NA>           <NA>
  ENST00000431853   processed_pseudogene             <NA>           <NA>
                          gene_id tx_support_level     tx_id_version gc_content
                      <character>        <integer>       <character>  <numeric>
  ENST00000456328 ENSG00000223972                1 ENST00000456328.2    55.3410
  ENST00000450305 ENSG00000223972             <NA> ENST00000450305.2    58.0696
              ...             ...              ...               ...        ...
  ENST00000435741 ENSG00000231514             <NA> ENST00000435741.1    55.1562
  ENST00000431853 ENSG00000235857             <NA> ENST00000431853.1    53.0612
                          tx_name
                      <character>
  ENST00000456328 ENST00000456328
  ENST00000450305 ENST00000450305
              ...             ...
  ENST00000435741 ENST00000435741
  ENST00000431853 ENST00000431853
  -------
  seqinfo: 454 sequences from GRCh38 genome
> 
Tim-Yu commented 3 years ago

Thanks to your code, I finally figured out that my 'query style' only returned UCSC version which apparently does not include too much metadata....

> ah <- AnnotationHub() %>% query(c("Homo sapiens","GTF","100","Ensembl"))
snapshotDate(): 2020-10-27
> gtf <- ah[4]
AnnotationHub with 1 record
# snapshotDate(): 2020-10-27
# names(): AH5015
# $dataprovider: UCSC
# $species: Homo sapiens
# $rdataclass: GRanges
# $rdatadateadded: 2013-03-26
# $title: Recomb Rate
# $description: GRanges object from UCSC track 'Recomb Rate'
# $taxonomyid: 9606
# $genome: hg19
# $sourcetype: UCSC track
# $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/recombRate
# $sourcesize: NA
# $tags: c("recombRate", "UCSC", "track", "Gene", "Transcript", "Annotation") 
# retrieve record with 'object[["AH5015"]]' 
> DbFile <- ensDbFromAH(gtf)
> edb <- EnsDb(DbFile)
> keytypes(edb)
 [1] "ENTREZID"    "EXONID"      "GENEBIOTYPE" "GENEID"      "GENENAME"    "SEQNAME"    
 [7] "SEQSTRAND"   "SYMBOL"      "TXBIOTYPE"   "TXID"        "TXNAME"
jorainer commented 3 years ago

Ah, yes. I would suggest you use the pre-built EnsDb database I distribute via AnnotationHub. I build them for every new Ensembl release (currently running the code for Ensembl 103) and they contain more annotations than what you can get from a GTF or GFF file.

jon4thin commented 2 years ago

As a followup for this, how does one filter to include NA records?

I tried a few guesses, including: CodingTranscripts_with_TSL_NA<- transcripts(ahEdb, filter=list(GeneBiotypeFilter("protein_coding"), TxSupportLevelFilter(is.na(value) == T) )

jorainer commented 2 years ago

I'm afraid that is not possible. The filters are internally translated to SQL calls and we did not foresee to filter also for missing values.

As a workaround you could do the additional subsetting by tx support level in R:

tx <- transcripts(edb, filter = ~ gene_biotype == "protein_coding")
tx <- tx[is.na(tx$tx_support_level)]
jon4thin commented 2 years ago

@jorainer I think I actually got it! If one changes the "condition" to "!=" - or not equal. I didnt understand how to change the condition at first. Thank you.

Granges_protein_coding_transcripts_TSL_1_or_NA <- transcripts(ahEdb, filter=list(GeneBiotypeFilter("protein_coding"), TxSupportLevelFilter(value = c(2,3,4,5), condition = "!=" )))