pepkit / pephub

A web API and database for biological sample metadata
https://pephub.databio.org
BSD 2-Clause "Simplified" License
13 stars 2 forks source link

Automate and streamline GEO archive #316

Open nsheff opened 3 weeks ago

nsheff commented 3 weeks ago

We produced a tarball of all the GEO peps in January 2024. This is useful, but it's quickly outdated, as new data is added to GEO.

We're automatically pulling that new data into PEPhub via https://github.com/pepkit/geopephub

We should automate the process of producing the "All GEO PEPs" downloadable tarball, and have an archive of these. TODO:

  1. [ ] Set up a schedule (maybe in geopephub?) that produces this tar archive. Maybe it should be quarterly?
  2. [ ] Push the tar archive to S3 somewhere.
  3. [ ] Create a dedicated page on PEPhub that explains the dataset, and has links on how to download it.
  4. [ ] The PEPhub page should automatically update when a new quarterly release happens
  5. [x] Get rid of the 'pephub' and 'pephub_geo.tar' clutter in /project/shefflab/processed -- these should be managed in a consistent way so they just get pushed to the right place, maybe stored in the deployment folder or something.