openvax / gtfparse

Parsing tools for GTF (gene transfer format) files
Apache License 2.0
108 stars 25 forks source link

First PR #1

Closed iskandr closed 8 years ago

iskandr commented 8 years ago

Copied from PyEnsembl's gtf_parsing module, split into testable components, slightly generalized and added tests.

Review on Reviewable

iskandr commented 8 years ago

Fixes https://github.com/hammerlab/pyensembl/issues/120

tavinathanson commented 8 years ago

Looks good overall. A few minor comments/questions. It doesn't really feel like a standalone library yet, since many comments refer to PyEnsembl. If we're going to stick to talking about Ensembl a lot, that should probably be confined to one area/class of the code.

Also, I didn't do a thorough job reviewing the actual parsing code (though I tried to scan it to some degree), partially because a lot of it looks mostly copy pasted. Let me know if there are any particular areas you think I should review. I also find GTF parsing hard to review in general since it's so (necessarily) messy/weird.


Reviewed 20 of 20 files at r1, 1 of 1 files at r2. Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


gtftools/line_parsing.py, line 109 [r2] (raw file): This seems oddly PyEnsembl-specific.


gtftools/read_gtf.py, line 132 [r2] (raw file): Deleting columns while iterating over columns?


gtftools/required_columns.py, line 24 [r1] (raw file): Maybe remove Ensembl-specific comments for parts that don't need to be Ensembl-specific? (Edit: this would be a pretty large change for another PR, but curious to get your thoughts)


gtftools/util.py, line 24 [r1] (raw file): Lame http://lists.apple.com/archives/darwin-kernel/2009/Mar/msg00005.html


README.md, line 3 [r1] (raw file): gtftools feels like a weird name to me. mhctools makes sense, since it's a library that wraps "tools". If this is a parsing library that isn't wrapping any "tools", what about gtfparse or something like that?


test/test_parse_gtf_lines.py, line 8 [r2] (raw file): Load from a data file?


test/test_read_stringtie_gtf.py, line 5 [r1] (raw file): Can we throw in RefSeq GTF testing, since it should nominally work (and I have a test in PyEnsembl)?


Comments from the review on Reviewable.io

iskandr commented 8 years ago

Sadly, can't get to the Reviewable page (says browser can't reach github address).

iskandr commented 8 years ago

Review status: 5 of 24 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


gtfparse/line_parsing.py, line 109 [r2] (raw file): I think this will come up as an issue for


gtfparse/read_gtf.py, line 135 [r2] (raw file): Fixed


gtfparse/required_columns.py, line 26 [r1] (raw file): I think that would be a much larger change (since this library is still, in truth, very much crafted around the peculiarities of parsing Ensembl files)


gtfparse/util.py, line 25 [r1] (raw file): Agreed!


README.md, line 3 [r1] (raw file): Done (with all the Reviewable antics that followed)


test/test_parse_gtf_lines.py, line 8 [r2] (raw file): I like seeing the data in the test


test/test_read_stringtie_gtf.py, line 5 [r1] (raw file): Done


Comments from the review on Reviewable.io

tavinathanson commented 8 years ago

Reviewed 9 of 23 files at r3, 10 of 10 files at r5. Review status: 23 of 24 files reviewed at latest revision, 8 unresolved discussions, some commit checks pending.


gtfparse/create_missing_features.py, line 64 [r5] (raw file): Not an error?


gtfparse/create_missing_features.py, line 73 [r5] (raw file): Why an OrderedDict?


gtfparse/create_missing_features.py, line 77 [r5] (raw file): Extra newline


gtfparse/create_missing_features.py, line 79 [r5] (raw file): typo: required


gtfparse/create_missing_features.py, line 109 [r5] (raw file): Typo: feature


gtfparse/create_missing_features.py, line 116 [r5] (raw file): Do you actually want to be printing here?


Comments from the review on Reviewable.io

tavinathanson commented 8 years ago

Review status: 23 of 24 files reviewed at latest revision, 8 unresolved discussions, some commit checks pending.


Comments from the review on Reviewable.io

tavinathanson commented 8 years ago

Review status: 23 of 24 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.


Comments from the review on Reviewable.io

iskandr commented 8 years ago

Review status: 23 of 24 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


gtfparse/create_missing_features.py, line 64 [r5] (raw file): This is maybe illogical, originally this was a helper function for read_gtf which checked that these features are all actually missing. I'm going to change this to just an info log + continue since PyEnsembl calls this with all parsed DataFrames


gtfparse/create_missing_features.py, line 73 [r5] (raw file): So that the DataFrame has columns in the same order as the GTF


gtfparse/create_missing_features.py, line 116 [r5] (raw file): Nope, thanks for catching that


Comments from the review on Reviewable.io

tavinathanson commented 8 years ago

Review status: 22 of 24 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


gtfparse/create_missing_features.py, line 64 [r5] (raw file): Ah, didn't realize it was always called


Comments from the review on Reviewable.io