Generalized ReferenceTranscripts into SequenceData, added protein sequences

iskandr commented 9 years ago

Replaced ReferenceTranscripts with a more general SequenceData class that gives us access to both transcript (cDNA) & protein (amino acid) sequences
Reduced the amount of state we have to pass into Gene & Transcript objects by giving them EnsemblRelease objects directly. This creates an implicit circular dependency (since Gene and Transcript are created by EnsemblRelease) but greatly simplifies access to both the annotation DB and sequence DB from within genes & transcripts.
Protein & cDNA FASTA paths generated by the same helper: url_templates.fasta_url.
Transcripts now have properties protein_id and protein_sequence.
Fixed: some unit tests weren't running since they weren't calling test_ensembl_releases()

tavinathanson commented 9 years ago

Comments from the review on Reviewable.io

Reviewed files:

pyensembl/init.py @ r1
pyensembl/common.py @ r1
pyensembl/database.py @ r2
pyensembl/ensembl_release.py @ r2
pyensembl/exon.py @ r1
pyensembl/gene.py @ r1
pyensembl/gtf.py @ r1
pyensembl/locus.py @ r1
pyensembl/reference_transcripts.py @ r1
pyensembl/release_info.py @ r1
pyensembl/search.py @ r2
pyensembl/sequence_data.py @ r1
pyensembl/transcript.py @ r2
pyensembl/url_templates.py @ r1
setup.py @ r3
test/data.py @ r2
test/test_common.py @ r2
test/test_dataframe_release75.py @ r1
test/test_download.py @ r1
test/test_ensembl_object_properties.py @ r1
test/test_gene_objects.py @ r2
test/test_id_length.py @ r2
test/test_reference_transcripts.py @ r1
test/test_search.py @ r2
test/test_transcript_objects.py @ r2
test/test_transcript_sequences.py @ r1

^{pyensembl/common.py, line 19 [r1] (raw file):} Weird that there isn't an obvious library for this!

^{pyensembl/database.py, line 401 [r3] (raw file):} Comment saying what required means?

^{pyensembl/ensembl_release.py, line 74 [r3] (raw file):} Minor, but to reduce complexity and increase readability, what about SequenceData doing the fasta_url import and call itself?

^{pyensembl/ensembl_release.py, line 208 [r3] (raw file):} How about making an is_valid_transcript_id/is_valid_protein_id in some common area? Seems generally useful.

^{pyensembl/gene.py, line 80 [r3] (raw file):} ensembl_release doesn't have __eq__?

^{pyensembl/sequence_data.py, line 1 [r3] (raw file):} I still think it'd be easier for reviews if renames are done separately. So, for example, one commit would use reference_transcripts (but use it as if it's sequence_data), and then another commit would literally just do the rename.

^{pyensembl/sequence_data.py, line 51 [r3] (raw file):} That's just awful.

I think this would be more readable if extracted out into self.local_filename = add_release_to_filename_if_not_there or something

^{pyensembl/sequence_data.py, line 94 [r3] (raw file):} So this is better than just self.local_filename == other.local_filename? Curious what your thoughts are there. I guess it depends on whether we view equality as a pre-lazy loading or post-lazy loading property?

^{pyensembl/sequence_data.py, line 153 [r3] (raw file):} True and False are no longer necessary (and the associated comment).

^{pyensembl/sequence_data.py, line 224 [r3] (raw file):} I think this can be extracted out into the same place as is_valid_protein_id, etc.

^{pyensembl/sequence_data.py, line 233 [r3] (raw file):} Maybe a bit confusing that clear_cache doesn't do anything with this in memory cache, even though they're entirely different?

^{pyensembl/transcript.py, line 78 [r3] (raw file):} Curious about what inspired you to do all this if/else/error re-ordering in multiple files.

^{pyensembl/transcript.py, line 177 [r3] (raw file):} Why is this a TODO, now that you have the decorator?

iskandr commented 9 years ago

Comments from the review on Reviewable.io

^{pyensembl/common.py, line 19 [r1] (raw file):} Yeah, I tried to find one and they all do too much (e.g. use __repr__ for cache keys, have timeouts and cache limits, serialize to disk, &c). I'd use lru_cache if it ever got back-ported from Python 3.x!

^{pyensembl/ensembl_release.py, line 74 [r3] (raw file):} So, pass species, server and sequence_type to SequenceData? I was considering that but the class seems simpler/less magical if it's just given a FASTA URL and expected to download/index that particular FASTA (all the Ensembl specific stuff stays in EnsemblRelease)

^{pyensembl/ensembl_release.py, line 208 [r3] (raw file):} Made require_human_transcript_id and require_human_protein_id in common

^{pyensembl/gene.py, line 80 [r3] (raw file):} Weird, I though it did! Just added one.

^{pyensembl/sequence_data.py, line 1 [r3] (raw file):} Well, it's a rename and a change of functionality. It wouldn't really make sense for ReferenceTranscripts to also be used for protein sequences. Not sure if there's a functional intermediate state we'd want to be in after a single smaller PR

^{pyensembl/sequence_data.py, line 94 [r3] (raw file):} I guess you might want to compare whether two SequenceData objects wrap the same sequences even if you haven't downloaded them yet. Changing this.

^{pyensembl/sequence_data.py, line 233 [r3] (raw file):} That's a bug!

^{pyensembl/transcript.py, line 78 [r3] (raw file):} I'd rather see the default (not-error) case first, makes the logic easier for me to follow

^{pyensembl/transcript.py, line 177 [r3] (raw file):} I added the decorator and forgot to come back, fixed

tavinathanson commented 9 years ago

Comments from the review on Reviewable.io

Reviewed files:

pyensembl/common.py @ r4
pyensembl/database.py @ r4
pyensembl/ensembl_release.py @ r4
pyensembl/exon.py @ r4
pyensembl/sequence_data.py @ r4
pyensembl/transcript.py @ r4
test/test_transcript_sequences.py @ r5

Merge away :+1:

^{pyensembl/ensembl_release.py, line 75 [r3] (raw file):} Well, my understanding is that it already has that information via getting passed ensembl_release, though that feels like abuse. In any event point about less magic in SequenceData seems valid! :+1:

^{pyensembl/sequence_data.py, line 1 [r3] (raw file):} Yup, that's why I said 1 commit (within a larger PR) vs. 1 PR. (That would let me diff that commit before it gets renamed.)

tavinathanson commented 9 years ago

Comments from the review on Reviewable.io

(Wait, I retract that. 1 second.)

tavinathanson commented 9 years ago

Comments from the review on Reviewable.io

Okay, merge away for real :)

iskandr commented 9 years ago

Comments from the review on Reviewable.io

^{pyensembl/sequence_data.py, line 1 [r3] (raw file):} Oh! Sorry, I misunderstood. Gotcha.

coveralls commented 7 years ago

Changes Unknown when pulling 4073c3b6d586551c247b253fe8009e7c79abc7e5 on protein-sequences into \ on master**.

openvax / pyensembl

Generalized ReferenceTranscripts into SequenceData, added protein sequences #70