Can we move raw summary statistics files from gwas catalog to GCP?

DSuveges commented 11 months ago

The current implementation of how the summary stats ingestion is going is quite inefficient.... although we might be able to save time on computation, the daily few hundred ingested summary stats won't scale (thinking about the upcoming flood of summary statistics). We need to explore option to move the data out of nfs faster.

Option 1 - Using data transfer google jobs

It's all on us. However we need to rely on ftp/https transfer speeds, which might not be too efficient. We need to assemble and update the file lists.

Option 2 - Copy files as datamover job

Still partially depends on ebi infra. Not too many jobs at a time. I assume way faster transfer times (there are failing due to termlimit, however never due to transfer). Still limited number of concurrent transfers due to job limitation.

Option 3 - from ftp on VM

Scripting up the downloads to google VMs from ftp, then move to google buckets. It might be solvable as a single step if bucket is directly mounted to VM.

@mbdebian , What are your opininons on this? I know you have invested quite a bit of time to write the very sophisticated tracking of study processing, but for option2, we can use at least some of this. Honestly, I think that could be the fastest.

tskir commented 11 months ago

Hi @DSuveges @mbdebian! If I could chime in: the investigations I've done previously on GWAS Catalog data suggest that, if we ingest using the FTP protocol from inside the Google Cloud, speeds should not be a limiting factor at all. This was previously just a quick test, but now I've performed a proper benchmark.

Set up benchmark

First, I fetched 1000 random harmonised study URLs from GWAS Catalog. I know it's a lot, but I wanted to have a decent sample:

wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/harmonised_list.txt \
| shuf -n 1000 \
| parallel echo "ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/{}" \
> list_to_download.txt

Then, I downloaded them on a Google Cloud VM using this command: time aria2c -i list_to_download.txt. (The client needs to be installed with sudo apt install aria2 first.)

I then checked the total size of all downloaded files with du -ch *.gz | grep total.

Benchmark results

It took 14 minutes 8 seconds of wall time in total to download 1000 studies, or 0.85 seconds of wall time per study on average.

The total size of all downloaded files is 284 GB, so the average effective download speed is approximately 343 MB/s (that's megabytes, not megabits).

tskir commented 11 months ago

Notes and considerations

Data ingress is free in Google Cloud, so we don't pay anything additional to transfer the data into it.
For the benchmark I used the aria2c download client, because it can handle downloading 1000s of files, download the files in parallel, retry failing downloads etc. However, I don't imagine results would be vastly different if we used something simpler like curl, wget, or even just ingest the FTP links directly inside a PySpark session.
While I was working on a Google Cloud VM, I think if we run it as a Dataproc job, speeds should be very similar.

Recommended approach

I don't think there is a need to have anything at all running on the EBI infrastructure in relation to the GWAS Catalog ingestion. I think we can implement this in the usual manner as part of the genetics_etl_python repository. The step can be made incremental, and then the Airflow workflow can trigger it once a week, or however frequently we want, to synchronise the lists and download the additional studies.

As soon as I'm done with eQTL Catalog, I can implement this ingestion myself.

DSuveges commented 11 months ago

It took 14 minutes 8 seconds of wall time in total to download 1000 studies

Ok. With the current approach it takes ~3+ days to get there. (however there could be 2 or even 3 orders of magnitude difference in the size of the files)

I did some exploration to see how quick the process could be. I too saw some performance improvement consistent with your observation, but the speed was nowhere near your. 25MB/s for ftp and 8MB/s for https all from non-ebi infra.

tskir commented 11 months ago

@DSuveges Thank you for your comments! Do you think you can try and reproduce my benchmark using the commands above? Maybe indeed I'm missing something. It would be good to know for sure that it works, can be reproduced, and is indeed that fast.

I attach the list of files that I used: list_to_download.txt — or alternatively you can re-run the command above and get a different random list of files, but then the comparison wouldn't be as exact.

I agree the point about differences in file size is very important. Hopefully by sampling 1,000 random studies I managed to get all sorts of extremes in the distribution, but of course we can never know for sure until we download everything. In my sample, the smallest study is 0.93 MB, the average is 291 MB, the largest is 1941 MB, and histogram looks like this:

tskir commented 11 months ago

Oh, and regarding speeds: I know that when I tested the download of just a single file with curl before, the speeds were only around 20–25 MB/s, which is very similar to what you describe.

I suspect the reason might be that, if downloading the files one by one, the client (curl) has to establish FTP handshake with the server every time for each file, which can be very slow and probably takes more time than the actual download for most studies. In contrast, when we fetch a list of files from the same FTP server, the handshake only has to be done once.

In this case, my comment above about how using aria2c wasn't important, should probably be amended. It looks like, for an efficient FTP download, it does matter which tool we use and how we request the files to achieve good performance.

d0choa commented 10 months ago

A few updates:

The GCP File transfer didn't fly. It complained about the robots.txt on the EBI https server. So we should discard Option 1 .

There were a couple of conversations with TSC today about how we could go about this. They suggested a solution similar to Option 2 option that would imply a data mover job in EBI infrastructure running a cron job, in which a gcloud rsync uploads all the files we need to the bucket. See example next:

# cat /nfs/rw/fg_general/TSC/logs/www-prod/upload-to-google.sh
#!/bin/bash
export https_proxy=http://www-proxy:8080/
export http_proxy=http://www-proxy:8080/

cd /nfs/rw/fg_general/TSC/logs/www-prod/

gcloud auth activate-service-account <service-account> --key-file=<service-key>.json

for i in $(find  -maxdepth 2 -mindepth 2 -type d)
do
        echo ${i#./}
        gsutil -q -m rsync -r ${i#./} [gs://transfer-logs-in/${i#./}](gs://transfer-logs-in/$%7Bi#./})
done
gcloud auth revoke <service-account>

echo "DONE "

We thought in the first place this could be an interesting option if we want to include it on the GWAS catalog-side. When the files are uploaded to FTP, they are also moved to our bucket. However, @jdhayhurst confirmed the FTP copy of the file is the only one left after the process is finished. So we will need a different process anyway. In this scenario not requiring to setup something on EBI infrastructure is probably a plus, but I have to acknowledge this setup seems simple enough to consider it.

This lead us to a Option 3+@tskir suggestions approach scenario:

I asked TSC if they anticipate any differences in performance between FTP->GCP vs EBI infrastructure -> GCP. At first glance, they had no immediate concerns so it does look like this approach could work.

The full requirements are simple:

job runs recurrently ideally without manual intervention or additional cost (e.g. rsync + Cloud Run/Scheduler)
files from https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/*/*/harmonised/*.h.tsv.gz @DSuveges please confirm
data is uploaded to GCP bucket gs://open-targets-gwas-summary-stats/raw_harmonized/

d0choa commented 10 months ago

@DSuveges and I prototyped a script that uploads all summary stats through a gsutil rsync command in the SLURM queue of the EBI cluster. Parallelised in 16 cores within the same node it runs at approximately 0.6 Gbytes/sec which is more than enough.

The rsync job excludes unnecessary files and uses the default option (timestamp instead of checksum). Moving this job to a SLURM cron would solve most of our issues. We'll wait for @mbdebian feedback. In the meantime, we have a copy of the harmonised sum stats that we can use to unblock other tasks.

P.S. We had a lot of headaches with the exclusion regexp. The same pattern behaves differently due to something we could not identify in the environment. It worked here as a simple script with local google SDK binaries, but not on Manuel's singularity image.

opentargets / issues