pepkit / geofetch

Builds a PEP from SRA or GEO accessions
https://pep.databio.org/geofetch/
BSD 2-Clause "Simplified" License
45 stars 5 forks source link

Downloading metadata for all GSE #129

Closed srp33 closed 7 months ago

srp33 commented 8 months ago

Very nice tool! Thanks for creating it. I want to build a table that has the description and experiment metadata for all series in GEO. Is there a way to do that other than loop through each series, one at a time?

Also, I am not sure how to get the experiment_metadata. It gives me an error. Below is the code I am using.

from geofetch import Geofetcher, Finder
import sys 

out_file_path = sys.argv[1]

gse_obj = Finder()
gse_list = sorted(gse_obj.get_gse_all())

geof = Geofetcher()

for gse in gse_list:
    project = geof.get_projects(gse, just_metadata=True)
    key = f"{gse}_raw"

    if key in project:
        print(project[key])
        print(project[key].description)
        print(project[key].experiment_metadata)
    break
nleroy917 commented 8 months ago

Hi @srp33! Thanks for opening up an issue. @khoroshevskyi is the core maintainer here, and he will totally be able to give you a much more detailed explanation of why this is the case. However, in the meantime to solve the issue in the meantime, you can use the following to get the experimental metadata:

project[key].config['experiment_metadata']
srp33 commented 8 months ago

Thanks, @nleroy917 and @khoroshevskyi !

That fixed worked for me. Now I'm just hoping to speed it up. I've downloaded metadata for 9000 series in 16 hours, so this will take awhile to get 216,000. Will I crash your server if I parallelize it?

nsheff commented 8 months ago

Will I crash your server if I parallelize it?

Geofetch isn't interfacing with our server, it's downloading the data directly from GEO. So, you'd have to ask NCBI that question :).

Two thoughts:

  1. If you're just downloading metadata, and not the data as well, then it surprises me it takes that long. Are you sure you're not downloading the underlying data files as well?

  2. If you're trying to grab data for a large number of files, you might instead be interested in pephub. We actually already processed all of GEO with geofetch and made all the metadata available via API. PEPhub is running on our servers, but it should be a lot faster since you'd just be grabbing the already-formatted metadata from an API. You might see if you can use that API to serve your needs. There's a python package, PEPhubClient, that can help you with a CLI for retrieving metadata from pephub. In a way, this is similar to geofetch, the difference is that with geofetch, you're downloading the SOFT files from GEO, and then formatting them into a PEP. with PEPhub, we already did that for all fo GEO, and it's not restricted to GEO -- you can edit your own projects on there as well.

If neither of these works, I probably wouldn't parallelize too much for fear of getting blocked by NCBI.

khoroshevskyi commented 8 months ago

As mentioned by @nsheff earlier, the data is retrieved directly from NCBI GEO. Additionally, you can find all this data on pephub.databio.org. To process all this data we were also using GEOfetch, and based on my experience and testing, the most time-consuming aspect was the download of the SOFT files. The complete download of all GEO data took us over a week, considering that some SOFT files are substantial, exceeding 500MB in size.

To expedite the process, there's an option to bypass saving SOFT files on your local machine if they are not required. This can be achieved with the following code:

prj = geof.get_projects(gse, just_metadata=True, discard_soft=True)

I would suggest to use PEPHub, and PEPhubClient to find and download GSE, as they are already processed.

Feel free to reach out if you need any assistance. Additionally, we welcome any suggestions you might have for improving the Geofetch API.

srp33 commented 8 months ago

Thanks for your help, @nsheff and @khoroshevskyi !! Sorry if I am missing it, but is there a way to download metadata for all GEO series in one go? Or would I need to do it one project at a time? If the latter, can you give me a tip on how to do that?

srp33 commented 8 months ago

I mean using pephubclient. @nsheff @khoroshevskyi

nsheff commented 8 months ago

Hey @srp33 -- I had somehow missed that you wanted to download all of GEO. There's no way to download all of it, it would be only one at a time. However, this got us thinking.

Basically what you want is just the parsed data that we already have in the database. So, we just need to add a new feature to do a routine database dump. So, I think this is doable, but it's not currently set up. The good thing about this is that since we're automatically indexing and parsing GEO daily as new records are added, you can always come back later and just get a new one to rerun on a more up-to-date version.

Give us a bit and @khoroshevskyi will try to implement a dumping mechanism so that you can download it.

khoroshevskyi commented 7 months ago

@srp33 Hi Stephen, I wanted to give you an update. We've recently deployed a new PEPhub instance, which includes a link to a tar file containing all GEO metadata saved as a PEP. You can find the download link here: https://pephub.databio.org/geo

image

Additionally, if you're interested in specific PEPs, you can log in to PEPhub to add them to your favorites or create groups of PEPs (referred to as POPs). You can also use our semantic search feature to find projects of interest.

If you have any other questions, please let us know.

srp33 commented 7 months ago

@khoroshevskyi Thank you kindly!