pepkit / pephub

A web API and database for biological sample metadata
https://pephub.databio.org
BSD 2-Clause "Simplified" License
12 stars 2 forks source link

Automate and streamline GEO archive #316

Open nsheff opened 5 months ago

nsheff commented 5 months ago

We produced a tarball of all the GEO peps in January 2024. This is useful, but it's quickly outdated, as new data is added to GEO.

We're automatically pulling that new data into PEPhub via https://github.com/pepkit/geopephub

We should automate the process of producing the "All GEO PEPs" downloadable tarball, and have an archive of these. TODO:

  1. [ ] Set up a schedule (maybe in geopephub?) that produces this tar archive. Maybe it should be quarterly?
  2. [ ] Push the tar archive to S3 somewhere.
  3. [ ] Create a dedicated page on PEPhub that explains the dataset, and has links on how to download it.
  4. [ ] The PEPhub page should automatically update when a new quarterly release happens
  5. [x] Get rid of the 'pephub' and 'pephub_geo.tar' clutter in /project/shefflab/processed -- these should be managed in a consistent way so they just get pushed to the right place, maybe stored in the deployment folder or something.
sanghoonio commented 2 months ago

goal: have a history popup or page to allow users to download the periodic geo archives

sanghoonio commented 2 months ago

@khoroshevskyi any chance you could work on the backend for this issue within this week? Can you make some kind of endpoint that spits out a list or dictionary of all the tar archive timestamps and s3 download links?

khoroshevskyi commented 2 months ago

There is already a function in https://github.com/pepkit/geopephub that will download all GEO projects. But there are 2 drawbacks: 1) Every project will be saved to the same folder. We should eliminate it and add subfolders (same way as it is done in bbcache) 2) We can't do it on backend, it's too much computation. We should run e.g. aws lambda and then uploads to s3 bucket.

I am a bit busy, I could probably work on this issue in a few weeks

nsheff commented 2 months ago

There is already a function in https://github.com/pepkit/geopephub that will download all GEO projects.

We don't want to actually download all GEO projects. We just want to extract the ones that are in pephub. Shouldn't this be a task for pepdbagent rather than geopephub?

We can't do it on backend, it's too much computation. We should run e.g. aws lambda and then uploads to s3 bucket.

Can you be more specific about the computation cost? how much time does it take? Does it require a lot of memory? Could it be done within the 15 minute limit for a github action?

khoroshevskyi commented 2 months ago

It requires around 1.5 hours (on rivanna) and around 2 GB of memory. I am using geopehub to run it from CLI, and underneath I use pepdbagent.

What we can do is download new uploads and updates to some repository, and store all GEO peps there, and then create one zip file. In other words, instead of downloading from pephub all GEO projects every time, most of them will be stored in some local repository

nsheff commented 2 months ago

I am using geopehub to run it from CLI, and underneath I use pepdbagent.

So just to be clear: You are not downloading them from the PEPhub API. You are not downloading them from GEO. You are retrieving them from the PEP database. Right?

instead of downloading from pephub all GEO projects every time, most of them will be stored in some local repository

Yes, this is what we should do. This means probably not aws lambda or github action. Probably rivanna.

Can you:

  1. create the repository of populated PEPs as a brick in the brickyard.
  2. write a simple shell or CLI command that will:
    • check for updates
    • download and update any new ones
    • create a tar archive
    • upload to B2
    • insert information about it into the database

It should take as a parameter:

It should store information about the run in a metadata file (a .json file) stored in the repository, so each time it runs, it knows when the last run was.

If you can create this script, I can handle creating a process on rivanna that will run it every quarter or something

khoroshevskyi commented 2 months ago

So just to be clear: You are not downloading them from the PEPhub API. You are not downloading them from GEO. You are retrieving them from the PEP database. Right?

yes, you are correct, I am NOT downloading them from GEO, I use PEPhub

Sounds good, it will be quite easy to do