Closed DSuveges closed 10 months ago
Hi @DSuveges @mbdebian! If I could chime in: the investigations I've done previously on GWAS Catalog data suggest that, if we ingest using the FTP protocol from inside the Google Cloud, speeds should not be a limiting factor at all. This was previously just a quick test, but now I've performed a proper benchmark.
First, I fetched 1000 random harmonised study URLs from GWAS Catalog. I know it's a lot, but I wanted to have a decent sample:
wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/harmonised_list.txt \
| shuf -n 1000 \
| parallel echo "ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/{}" \
> list_to_download.txt
Then, I downloaded them on a Google Cloud VM using this command: time aria2c -i list_to_download.txt
. (The client needs to be installed with sudo apt install aria2
first.)
I then checked the total size of all downloaded files with du -ch *.gz | grep total
.
It took 14 minutes 8 seconds of wall time in total to download 1000 studies, or 0.85 seconds of wall time per study on average.
The total size of all downloaded files is 284 GB, so the average effective download speed is approximately 343 MB/s (that's megabytes, not megabits).
I don't think there is a need to have anything at all running on the EBI infrastructure in relation to the GWAS Catalog ingestion. I think we can implement this in the usual manner as part of the genetics_etl_python repository. The step can be made incremental, and then the Airflow workflow can trigger it once a week, or however frequently we want, to synchronise the lists and download the additional studies.
As soon as I'm done with eQTL Catalog, I can implement this ingestion myself.
It took 14 minutes 8 seconds of wall time in total to download 1000 studies
Ok. With the current approach it takes ~3+ days to get there. (however there could be 2 or even 3 orders of magnitude difference in the size of the files)
I did some exploration to see how quick the process could be. I too saw some performance improvement consistent with your observation, but the speed was nowhere near your. 25MB/s for ftp and 8MB/s for https all from non-ebi infra.
@DSuveges Thank you for your comments! Do you think you can try and reproduce my benchmark using the commands above? Maybe indeed I'm missing something. It would be good to know for sure that it works, can be reproduced, and is indeed that fast.
I attach the list of files that I used: list_to_download.txt — or alternatively you can re-run the command above and get a different random list of files, but then the comparison wouldn't be as exact.
I agree the point about differences in file size is very important. Hopefully by sampling 1,000 random studies I managed to get all sorts of extremes in the distribution, but of course we can never know for sure until we download everything. In my sample, the smallest study is 0.93 MB, the average is 291 MB, the largest is 1941 MB, and histogram looks like this:
Oh, and regarding speeds: I know that when I tested the download of just a single file with curl before, the speeds were only around 20–25 MB/s, which is very similar to what you describe.
I suspect the reason might be that, if downloading the files one by one, the client (curl) has to establish FTP handshake with the server every time for each file, which can be very slow and probably takes more time than the actual download for most studies. In contrast, when we fetch a list of files from the same FTP server, the handshake only has to be done once.
In this case, my comment above about how using aria2c wasn't important, should probably be amended. It looks like, for an efficient FTP download, it does matter which tool we use and how we request the files to achieve good performance.
A few updates:
The GCP File transfer didn't fly. It complained about the robots.txt on the EBI https server. So we should discard Option 1 .
There were a couple of conversations with TSC today about how we could go about this. They suggested a solution similar to Option 2 option that would imply a data mover job in EBI infrastructure running a cron job, in which a gcloud rsync uploads all the files we need to the bucket. See example next:
# cat /nfs/rw/fg_general/TSC/logs/www-prod/upload-to-google.sh
#!/bin/bash
export https_proxy=http://www-proxy:8080/
export http_proxy=http://www-proxy:8080/
cd /nfs/rw/fg_general/TSC/logs/www-prod/
gcloud auth activate-service-account <service-account> --key-file=<service-key>.json
for i in $(find -maxdepth 2 -mindepth 2 -type d)
do
echo ${i#./}
gsutil -q -m rsync -r ${i#./} [gs://transfer-logs-in/${i#./}](gs://transfer-logs-in/$%7Bi#./})
done
gcloud auth revoke <service-account>
echo "DONE "
We thought in the first place this could be an interesting option if we want to include it on the GWAS catalog-side. When the files are uploaded to FTP, they are also moved to our bucket. However, @jdhayhurst confirmed the FTP copy of the file is the only one left after the process is finished. So we will need a different process anyway. In this scenario not requiring to setup something on EBI infrastructure is probably a plus, but I have to acknowledge this setup seems simple enough to consider it.
This lead us to a Option 3+@tskir suggestions approach scenario:
I asked TSC if they anticipate any differences in performance between FTP->GCP vs EBI infrastructure -> GCP. At first glance, they had no immediate concerns so it does look like this approach could work.
The full requirements are simple:
gs://open-targets-gwas-summary-stats/raw_harmonized/
@DSuveges and I prototyped a script that uploads all summary stats through a gsutil rsync
command in the SLURM queue of the EBI cluster. Parallelised in 16 cores within the same node it runs at approximately 0.6 Gbytes/sec which is more than enough.
The rsync job excludes unnecessary files and uses the default option (timestamp instead of checksum). Moving this job to a SLURM cron would solve most of our issues. We'll wait for @mbdebian feedback. In the meantime, we have a copy of the harmonised sum stats that we can use to unblock other tasks.
P.S. We had a lot of headaches with the exclusion regexp. The same pattern behaves differently due to something we could not identify in the environment. It worked here as a simple script with local google SDK binaries, but not on Manuel's singularity image.
The current implementation of how the summary stats ingestion is going is quite inefficient.... although we might be able to save time on computation, the daily few hundred ingested summary stats won't scale (thinking about the upcoming flood of summary statistics). We need to explore option to move the data out of nfs faster.
Option 1 - Using data transfer google jobs
It's all on us. However we need to rely on ftp/https transfer speeds, which might not be too efficient. We need to assemble and update the file lists.
Option 2 - Copy files as datamover job
Still partially depends on ebi infra. Not too many jobs at a time. I assume way faster transfer times (there are failing due to termlimit, however never due to transfer). Still limited number of concurrent transfers due to job limitation.
Option 3 - from ftp on VM
Scripting up the downloads to google VMs from ftp, then move to google buckets. It might be solvable as a single step if bucket is directly mounted to VM.
@mbdebian , What are your opininons on this? I know you have invested quite a bit of time to write the very sophisticated tracking of study processing, but for option2, we can use at least some of this. Honestly, I think that could be the fastest.