scverse / genomic-features

Genomic Features in Python from BioConductor's AnnotationHub
https://genomic-features.readthedocs.io
BSD 3-Clause "New" or "Revised" License
18 stars 5 forks source link

UCSC databases (TxDB) #11

Open ivirshup opened 1 year ago

ivirshup commented 1 year ago

Description of feature

Getting UCSC data from TxDB bioconductor sources

ivirshup commented 3 months ago

It would be useful to write down what exactly the differences between EnsDB and TxDB are.

To play around with this:

access via ibis

import genomic_features as gf
import ibis

!wget https://bioconductorhubs.blob.core.windows.net/annotationhub/ucsc/standard/3.15/TxDb.Hsapiens.UCSC.hg38.knownGene.sqlite

ensdb = gf.ensembl.annotation(species="Hsapiens", version="108").db
ucscdb = ibis.connect("TxDb.Hsapiens.UCSC.hg38.knownGene.sqlite")

for tbl_name in ensdb.list_tables():
    print(tbl_name, ensdb.table(tbl_name).schema())
EnsDB schema ```python chromosome ibis.Schema { seq_name string seq_length int32 is_circular int32 } entrezgene ibis.Schema { gene_id string entrezid int32 } exon ibis.Schema { exon_id string exon_seq_start int32 exon_seq_end int32 } gene ibis.Schema { gene_id string gene_name string gene_biotype string gene_seq_start int32 gene_seq_end int32 seq_name string seq_strand int32 seq_coord_system string description string gene_id_version string canonical_transcript string } metadata ibis.Schema { name string value string } protein ibis.Schema { tx_id string protein_id string protein_sequence string } protein_domain ibis.Schema { protein_id string protein_domain_id string protein_domain_source string interpro_accession string prot_dom_start int32 prot_dom_end int32 } tx ibis.Schema { tx_id string tx_biotype string tx_seq_start int32 tx_seq_end int32 tx_cds_seq_start int32 tx_cds_seq_end int32 gene_id string tx_support_level int32 tx_id_version string gc_content float64 tx_external_name string tx_is_canonical int32 } tx2exon ibis.Schema { tx_id string exon_id string exon_idx int32 } uniprot ibis.Schema { protein_id string uniprot_id string uniprot_db string uniprot_mapping_type string } ```
for tbl_name in ucscdb.list_tables():
    print(tbl_name, ucscdb.table(tbl_name).schema())
TxDB schema ``` cds ibis.Schema { _cds_id int32 cds_name string cds_chrom !string cds_strand !string cds_start !int32 cds_end !int32 } chrominfo ibis.Schema { _chrom_id int32 chrom !string length int32 is_circular int32 } exon ibis.Schema { _exon_id int32 exon_name string exon_chrom !string exon_strand !string exon_start !int32 exon_end !int32 } gene ibis.Schema { gene_id !string _tx_id !int32 } metadata ibis.Schema { name string value string } splicing ibis.Schema { _tx_id !int32 exon_rank !int32 _exon_id !int32 _cds_id int32 cds_phase int32 } transcript ibis.Schema { _tx_id int32 tx_name string tx_type string tx_chrom !string tx_strand !string tx_start !int32 tx_end !int32 } ```

It does look like the UCSC sqlite files carry less information.

Docs/ links

It's probably worth looking into how the bioconductor packages deal with having two different schemas. E.g. do they subclass, are the annotation filters aware?

cc: @nvictus

ivirshup commented 3 months ago

Re discussion about nonstandard chromosome names @nvictus: https://github.com/jorainer/ensembldb/issues/88