openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
373 stars 65 forks source link

Support for satellite ensembl database sites? #126

Open xguse opened 9 years ago

xguse commented 9 years ago

Is there a way to use pyensembl for some of the other genome databases built with ensembl's code? Examples are all the sites hosted under http://ensemblgenomes.org/:

Also, https://www.vectorbase.org uses the ensembl code but is focused specifically on vectors of disease.

I would VERY much like to use pyensembl. It SEEMS like this would/SHOULD be as simple as pointing to a non-hardcoded URL? I have not been able to find documentation about this however.

Any thoughts?

Gus

iskandr commented 9 years ago

Hi @xguse,

We haven't tried this much, but you should be able to use external data sources using the pyensembl.Genome class. It's a bit verbose though, let me know if you have ideas for a cleaner API:

In [1]: from pyensembl import Genome

In [2]: print(Genome.__init__.__doc__)

        Parameters
        ----------
        reference_name : str
            Name of genome assembly which annotations in GTF are aligned against
            (and from which sequence data is drawn)

        annotation_name : str
            Name of annotation source (e.g. "Ensembl)

        annotation_version : int or str
            Version of annotation database (e.g. 75)

        gtf_path_or_url : str
            Path or URL of GTF file

        transcript_fasta_path_or_url : str
            Path or URL of FASTA file containing transcript sequences

        protein_fasta_path_or_url : str
            Path or URL of FASTA file containing protein sequences

        decompress_on_download : bool
            If remote file is compressed, decompress the local copy?

        copy_local_files_to_cache : bool
            If genome data file is local use it directly or copy to cache first?

        require_ensembl_ids : bool
            Check gene/transcript/exon IDs to make sure they start with "ENS"

In [3]: genome = Genome(
    reference_name="GCA_000763955.1", 
    annotation_name="ensembl", 
    annotation_version=28, 
    gtf_path_or_url="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-28/gtf/bacteria_86_collection/_clostridium_innocuum/_clostridium_innocuum.GCA_000763955.1.28.gtf.gz", 
    transcript_fasta_path_or_url="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-28/fasta/bacteria_86_collection/_clostridium_innocuum/cdna/_clostridium_innocuum.GCA_000763955.1.28.cdna.all.fa.gz", 
    protein_fasta_path_or_url="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-28/fasta/bacteria_86_collection/_clostridium_innocuum/pep/_clostridium_innocuum.GCA_000763955.1.28.pep.all.fa.gz")

In [4]: genome.download()

In [5]: genome.index()

In [6]: genome.transcripts()
Out[6]:
[Transcript(id=CIAN88_00035, name=CIAN88_00035, gene_id=CIAN88_00035, gene_name=CIAN88_00035, biotype=tRNA, location=CONTIG1:6377-6463),
 Transcript(id=CIAN88_00265, name=CIAN88_00265, gene_id=CIAN88_00265, gene_name=CIAN88_00265, biotype=tRNA, location=CONTIG1:52499-52572),
 Transcript(id=CIAN88_08055, name=CIAN88_08055, gene_id=CIAN88_08055, gene_name=CIAN88_08055, biotype=ncRNA, location=CONTIG35:100340-100697),
 Transcript(id=CIAN88_09310, name=CIAN88_09310, gene_id=CIAN88_09310, gene_name=CIAN88_09310, biotype=tRNA, location=CONTIG40:19324-19411),
 Transcript(id=CIAN88_09325, name=CIAN88_09325, gene_id=CIAN88_09325, gene_name=CIAN88_09325, biotype=tRNA, location=CONTIG40:21994-22069),
 Transcript(id=CIAN88_09705, name=CIAN88_09705, gene_id=CIAN88_09705, gene_name=CIAN88_09705, biotype=tRNA, location=CONTIG41:13257-13333),
...
iskandr commented 9 years ago

@tavinathanson Wondering if you have ideas about how we can expose these alternative data sources more cleanly

xguse commented 9 years ago

First, thanks for your help. This is basically what I ended up trying but through the command line interface. It seemed that things "built" but I haven't gone much further.

So this brings up another question I had. I don't see anywhere encoding the genome sequence, only the gene products (transcripts/peptides). Is pyensembl more focused on genes than genomes?

iskandr commented 9 years ago

Our use-case so far has been supporting variant effect annotation for human (& mouse) genomes, so we've only needed transcripts and peptides. I've thought about adding full genome sequences for a while but it implies a lot of possible changes which I thought should be guided by a concrete usage (e.g. possibly adding Chromosome or Intron objects).

It would be instructive for us to see how you'd like to (ideally) use PyEnsembl and where full genome sequences fit into that.