Closed tavinathanson closed 9 years ago
Since this view doesn't show what's changed with Genome
vs. EnsemblRelease
, here's a diff of old EnsemblRelease
vs. new Genome
: https://gist.github.com/tavinathanson/eb786789a022153d9470
@iskandr Another thought: I'm thinking a better API would probably involve saving a collection of URLs/paths as a new "release" in the DB, rather than always requiring each path to be specified. This would also more easily allow topiary to refer to that collection of paths.
For example:
pyensembl add-genome --name "mouse_81" --gtf-path <URL> --transcript-fasta-path <URL>
and then
pyensembl install "mouse_81"
topiary --pyensembl-genome "mouse_81"
mouse_81
could be saved to the database, mapped to those paths.
I think I'll save that for a follow-up PR. Thoughts?
P.S. That PR would be a better place, I think, to address the naming situation that isn't currently all that smart. Namely, two local GTF files with the same file name wouldn't be able to co-exist as different DBs, since the DB is just based on the GTF filename. I'll create issues for all the above unless you think it needs to be addressed in this PR.
OK, here's a thought:
"genome" = "genome annotation" (GTF) + "genome sequence" (FASTA files)
So, we may eventually want to add genome sequences separately from annotations.
I like your idea below:
pyensembl add-genome --name "mouse_81" --gtf-path <URL> --transcript-fasta-path <URL>
but might want to even extend it to:
pyensembl add-genome-sequence "mouse" --transcript-fasta-path <URL>
pyensembl add-genome --name "mouse_81" --gtf-path <URL> --genomoe-sequence "mouse"
Agreed that we can figure this out in a later PR.
"genome" = "genome annotation" (GTF) + "genome sequence" (FASTA files) sounds reasonable for talking about these entities for the time being, even though as we've discussed "genome" is still confusing.
Made https://github.com/hammerlab/pyensembl/issues/100 for add-genome
TODO before merging, from offline discussion:
SequenceData
and GTF
fasta_path
from GenomeSource
path
to path_or_url
in most placeslocal_fasta_filename_func
to EnsemblReleaseSource
, and local/remote filename logic to GenomeSource
I believe all code review comments are now addressed.
@iskandr I ended up changing the role of GenomeSource
: it now represents a single URL or path. It continues to return install messages, but now does other useful things: it handles the original/cached filename transform (like we discussed), and it also handles all the is_url_format
and local copying logic. It's purely internal, now; I added the GTF/FASTA arguments to Genome
itself.
It's not perfect, but I think/hope it's moving in the right direction. I won't be able to address further comments until next week. If you think it's ready to merge, feel free!
I'm surprised that a Genome
now has multiple GenomeSource
objects for each GTF and FASTA file. I'm also surprised that EnsemblReleaseSource
survived despite that change in the role of source objects. Still, since you're gone for a week and kept the API backward compatible I'd rather merge this now and then discuss this zoo of objects when you get back.
I tried to run the unit tests and got the following error:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/Users/iskander/code/pyensembl/test/test_ucsc_gtf.py", line 38, in test_ucsc_refseq
eq_(len(genome.genes()), 2)
File "/Users/iskander/code/pyensembl/pyensembl/common.py", line 56, in wrapped_fn
value = fn(*args, **kwargs)
File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 420, in genes
gene_ids = self.gene_ids(contig=contig, strand=strand)
File "/Users/iskander/code/pyensembl/pyensembl/common.py", line 56, in wrapped_fn
value = fn(*args, **kwargs)
File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 566, in gene_ids
strand=strand)
File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 218, in all_feature_values
return cached_object(pickle_path, compute_fn=run_query)
File "/Users/iskander/code/pyensembl/pyensembl/compute_cache.py", line 120, in cached_object
obj = compute_fn()
File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 212, in run_query
strand=strand)
File "/Users/iskander/code/pyensembl/pyensembl/common.py", line 54, in wrapped_fn
return cache[cache_key]
File "/Users/iskander/code/pyensembl/pyensembl/database.py", line 61, in __hash__
return hash(self.gtf)
File "/Users/iskander/code/pyensembl/pyensembl/gtf.py", line 57, in __hash__
return hash(self.gtf_source)
nose.proxy.TypeError: unhashable type: 'GenomeSource'
Running the unit tests with Python 2.7 I get different errors:
ERROR: test_all_gene_names : Make sure some known gene names such as
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nose-1.3.6-py2.7.egg/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/Users/iskander/code/pyensembl/test/test_common.py", line 35, in new_test_fn
test_fn(ensembl)
File "/Users/iskander/code/pyensembl/test/test_gene_names.py", line 23, in test_all_gene_names
gene_names = ensembl.gene_names()
File "/Users/iskander/code/pyensembl/pyensembl/common.py", line 56, in wrapped_fn
value = fn(*args, **kwargs)
File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 519, in gene_names
strand=strand)
File "/Users/iskander/code/pyensembl/pyensembl/genome.py", line 218, in all_feature_values
return cached_object(pickle_path, compute_fn=run_query)
File "/Users/iskander/code/pyensembl/pyensembl/compute_cache.py", line 118, in cached_object
obj = pickle.load(f)
ValueError: unsupported pickle protocol: 4
Problems fixed, though it seems like the pickling issues may require nuking all .pickle
files in the cache.
Summary:
Genome
class replacesEnsemblRelease
class.GenomeSource
class specifies URLs to GTFs and FASTAs, extracted into its own class to prevent argument overload inGenome
and to create an easy object to pass around for generating python/console install strings (e.g.pyensembl install --release 77
andpyensembl install --gtf_url blah
).EnsemblReleaseSource
extendsGenomeSource
to handle Ensembl URL creation and install strings.GenomeSource
accepts both local file paths and remote URLs. If the latter, datacache is bypassed.gene.py
,transcript.py
,gtf_parsing.py
andgenome.py
to allow for missing gene names, transcript names and biotypes in GTF files. (I noticed that this can happen with UCSC GTF files from genome.ucsc.edu/cgi-bin/hgTables, such as the one included in this PR as a test.) See this commit: https://github.com/hammerlab/pyensembl/commit/a2321156ccfb7e682750e8fc555e5cb61e7e9bf0shell.py
to allow for specific URLs in addition to release numbers.This needs more testing, but I want to get the PR started in case you think the overall strategy is problematic.
Also TODO, either in this PR or a follow-up: