standage / genhub

Explore eukaryotic genome composition and organization with iLoci
BSD 3-Clause "New" or "Revised" License
6 stars 3 forks source link

Add support for obsolete RefSeq reference genomes #94

Closed standage closed 5 years ago

standage commented 5 years ago

It appears when a RefSeq assembly is replaced with a newer version, the older assembly and latest corresponding annotation are migrated to the all_assembly_versions/suppressed/ directory on the RefSeq FTP site. This pull request updates GenHub's RefSeq module to support downloading these "obsolete" reference genomes.

In the future, when new genome versions are posted, the default action will be to update the existing recipe to point to the suppressed directory, and add a new recipe for the new version.

This PR was motivated by #92, but with no straightforward way to modify that PR I have started a new PR with many of the same changes.

In addition to marking existing recipes as "suppressed", I anticipate many will need their annotation checksums updated. See note below for more details.

Remaining tasks


Note: In the few years I was monitoring insect genomes on RefSeq closely, I noted that the GFF files would be routinely and silently updated, without any official announcement or record of previous versions. Most of these changes were superficial, such as a slight change to the formatting of certain attributes in the 9th column for example. Some updates actually changed gene models, though. The naïve checksum approach I originally implemented to monitor the status of RefSeq builds does not distinguish between inconsequential changes and larger changes requiring close examination to make sure GenHub handles everything correctly. Doing an automated schema check as suggested in #80 is probably a more sustainable path for future maintenance, but has not yet been implemented.

vpbrendel commented 5 years ago

Thanks. This looks reasonable.

standage commented 5 years ago

Since previous versions of the data aren't easily accessible, I had to settle for some basic sanity checks on the existing recipes. Everything looks in order though, and based on experience my best guess is that only minor changes have been made in the last couple of years.

codecov-io commented 5 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@52798d3). Click here to learn what that means. The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##             master    #94   +/-   ##
=======================================
  Coverage          ?   100%           
=======================================
  Files             ?     17           
  Lines             ?   1915           
  Branches          ?    197           
=======================================
  Hits              ?   1915           
  Misses            ?      0           
  Partials          ?      0
Impacted Files Coverage Δ
genhub/hymbase.py 100% <100%> (ø)
genhub/cdhit.py 100% <100%> (ø)
genhub/exons.py 100% <100%> (ø)
genhub/am10.py 100% <100%> (ø)
genhub/crg.py 100% <100%> (ø)
genhub/refseq.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 52798d3...6ff546e. Read the comment docs.