Open nsheff opened 5 months ago
goal: have a history popup or page to allow users to download the periodic geo archives
@khoroshevskyi any chance you could work on the backend for this issue within this week? Can you make some kind of endpoint that spits out a list or dictionary of all the tar archive timestamps and s3 download links?
There is already a function in https://github.com/pepkit/geopephub that will download all GEO projects. But there are 2 drawbacks: 1) Every project will be saved to the same folder. We should eliminate it and add subfolders (same way as it is done in bbcache) 2) We can't do it on backend, it's too much computation. We should run e.g. aws lambda and then uploads to s3 bucket.
I am a bit busy, I could probably work on this issue in a few weeks
There is already a function in https://github.com/pepkit/geopephub that will download all GEO projects.
We don't want to actually download all GEO projects. We just want to extract the ones that are in pephub. Shouldn't this be a task for pepdbagent rather than geopephub?
We can't do it on backend, it's too much computation. We should run e.g. aws lambda and then uploads to s3 bucket.
Can you be more specific about the computation cost? how much time does it take? Does it require a lot of memory? Could it be done within the 15 minute limit for a github action?
It requires around 1.5 hours (on rivanna) and around 2 GB of memory. I am using geopehub to run it from CLI, and underneath I use pepdbagent.
What we can do is download new uploads and updates to some repository, and store all GEO peps there, and then create one zip file. In other words, instead of downloading from pephub all GEO projects every time, most of them will be stored in some local repository
I am using geopehub to run it from CLI, and underneath I use pepdbagent.
So just to be clear: You are not downloading them from the PEPhub API. You are not downloading them from GEO. You are retrieving them from the PEP database. Right?
instead of downloading from pephub all GEO projects every time, most of them will be stored in some local repository
Yes, this is what we should do. This means probably not aws lambda or github action. Probably rivanna.
Can you:
It should take as a parameter:
It should store information about the run in a metadata file (a .json
file) stored in the repository, so each time it runs, it knows when the last run was.
If you can create this script, I can handle creating a process on rivanna that will run it every quarter or something
So just to be clear: You are not downloading them from the PEPhub API. You are not downloading them from GEO. You are retrieving them from the PEP database. Right?
yes, you are correct, I am NOT downloading them from GEO, I use PEPhub
Sounds good, it will be quite easy to do
We produced a tarball of all the GEO peps in January 2024. This is useful, but it's quickly outdated, as new data is added to GEO.
We're automatically pulling that new data into PEPhub via https://github.com/pepkit/geopephub
We should automate the process of producing the "All GEO PEPs" downloadable tarball, and have an archive of these. TODO:
/project/shefflab/processed
-- these should be managed in a consistent way so they just get pushed to the right place, maybe stored in thedeployment
folder or something.