Open xguse opened 9 years ago
Hi @xguse,
We haven't tried this much, but you should be able to use external data sources using the pyensembl.Genome
class. It's a bit verbose though, let me know if you have ideas for a cleaner API:
In [1]: from pyensembl import Genome
In [2]: print(Genome.__init__.__doc__)
Parameters
----------
reference_name : str
Name of genome assembly which annotations in GTF are aligned against
(and from which sequence data is drawn)
annotation_name : str
Name of annotation source (e.g. "Ensembl)
annotation_version : int or str
Version of annotation database (e.g. 75)
gtf_path_or_url : str
Path or URL of GTF file
transcript_fasta_path_or_url : str
Path or URL of FASTA file containing transcript sequences
protein_fasta_path_or_url : str
Path or URL of FASTA file containing protein sequences
decompress_on_download : bool
If remote file is compressed, decompress the local copy?
copy_local_files_to_cache : bool
If genome data file is local use it directly or copy to cache first?
require_ensembl_ids : bool
Check gene/transcript/exon IDs to make sure they start with "ENS"
In [3]: genome = Genome(
reference_name="GCA_000763955.1",
annotation_name="ensembl",
annotation_version=28,
gtf_path_or_url="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-28/gtf/bacteria_86_collection/_clostridium_innocuum/_clostridium_innocuum.GCA_000763955.1.28.gtf.gz",
transcript_fasta_path_or_url="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-28/fasta/bacteria_86_collection/_clostridium_innocuum/cdna/_clostridium_innocuum.GCA_000763955.1.28.cdna.all.fa.gz",
protein_fasta_path_or_url="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-28/fasta/bacteria_86_collection/_clostridium_innocuum/pep/_clostridium_innocuum.GCA_000763955.1.28.pep.all.fa.gz")
In [4]: genome.download()
In [5]: genome.index()
In [6]: genome.transcripts()
Out[6]:
[Transcript(id=CIAN88_00035, name=CIAN88_00035, gene_id=CIAN88_00035, gene_name=CIAN88_00035, biotype=tRNA, location=CONTIG1:6377-6463),
Transcript(id=CIAN88_00265, name=CIAN88_00265, gene_id=CIAN88_00265, gene_name=CIAN88_00265, biotype=tRNA, location=CONTIG1:52499-52572),
Transcript(id=CIAN88_08055, name=CIAN88_08055, gene_id=CIAN88_08055, gene_name=CIAN88_08055, biotype=ncRNA, location=CONTIG35:100340-100697),
Transcript(id=CIAN88_09310, name=CIAN88_09310, gene_id=CIAN88_09310, gene_name=CIAN88_09310, biotype=tRNA, location=CONTIG40:19324-19411),
Transcript(id=CIAN88_09325, name=CIAN88_09325, gene_id=CIAN88_09325, gene_name=CIAN88_09325, biotype=tRNA, location=CONTIG40:21994-22069),
Transcript(id=CIAN88_09705, name=CIAN88_09705, gene_id=CIAN88_09705, gene_name=CIAN88_09705, biotype=tRNA, location=CONTIG41:13257-13333),
...
@tavinathanson Wondering if you have ideas about how we can expose these alternative data sources more cleanly
First, thanks for your help. This is basically what I ended up trying but through the command line interface. It seemed that things "built" but I haven't gone much further.
So this brings up another question I had. I don't see anywhere encoding the genome sequence, only the gene products (transcripts/peptides). Is pyensembl more focused on genes than genomes?
Our use-case so far has been supporting variant effect annotation for human (& mouse) genomes, so we've only needed transcripts and peptides. I've thought about adding full genome sequences for a while but it implies a lot of possible changes which I thought should be guided by a concrete usage (e.g. possibly adding Chromosome
or Intron
objects).
It would be instructive for us to see how you'd like to (ideally) use PyEnsembl and where full genome sequences fit into that.
Is there a way to use pyensembl for some of the other genome databases built with ensembl's code? Examples are all the sites hosted under http://ensemblgenomes.org/:
Also, https://www.vectorbase.org uses the ensembl code but is focused specifically on vectors of disease.
I would VERY much like to use pyensembl. It SEEMS like this would/SHOULD be as simple as pointing to a non-hardcoded URL? I have not been able to find documentation about this however.
Any thoughts?
Gus