Closed standage closed 5 years ago
Thanks. This looks reasonable.
Since previous versions of the data aren't easily accessible, I had to settle for some basic sanity checks on the existing recipes. Everything looks in order though, and based on experience my best guess is that only minor changes have been made in the last couple of years.
:exclamation: No coverage uploaded for pull request base (
master@52798d3
). Click here to learn what that means. The diff coverage is100%
.
@@ Coverage Diff @@
## master #94 +/- ##
=======================================
Coverage ? 100%
=======================================
Files ? 17
Lines ? 1915
Branches ? 197
=======================================
Hits ? 1915
Misses ? 0
Partials ? 0
Impacted Files | Coverage Δ | |
---|---|---|
genhub/hymbase.py | 100% <100%> (ø) |
|
genhub/cdhit.py | 100% <100%> (ø) |
|
genhub/exons.py | 100% <100%> (ø) |
|
genhub/am10.py | 100% <100%> (ø) |
|
genhub/crg.py | 100% <100%> (ø) |
|
genhub/refseq.py | 100% <100%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 52798d3...6ff546e. Read the comment docs.
It appears when a RefSeq assembly is replaced with a newer version, the older assembly and latest corresponding annotation are migrated to the
all_assembly_versions/suppressed/
directory on the RefSeq FTP site. This pull request updates GenHub's RefSeq module to support downloading these "obsolete" reference genomes.In the future, when new genome versions are posted, the default action will be to update the existing recipe to point to the
suppressed
directory, and add a new recipe for the new version.This PR was motivated by #92, but with no straightforward way to modify that PR I have started a new PR with many of the same changes.
In addition to marking existing recipes as "suppressed", I anticipate many will need their annotation checksums updated. See note below for more details.
Remaining tasks
Pbar(added support forguide_RNA
features in fca221e)Note: In the few years I was monitoring insect genomes on RefSeq closely, I noted that the GFF files would be routinely and silently updated, without any official announcement or record of previous versions. Most of these changes were superficial, such as a slight change to the formatting of certain attributes in the 9th column for example. Some updates actually changed gene models, though. The naïve checksum approach I originally implemented to monitor the status of RefSeq builds does not distinguish between inconsequential changes and larger changes requiring close examination to make sure GenHub handles everything correctly. Doing an automated schema check as suggested in #80 is probably a more sustainable path for future maintenance, but has not yet been implemented.