mintproject / MINT-Data-Sync

Scripts to download new datasets as it becomes available and register them in MINT Data Catalog
1 stars 1 forks source link

Download GPM data on TACC #1

Open mosoriob opened 3 years ago

mosoriob commented 3 years ago

We need to download the GPM Data

dnfeldman commented 3 years ago

@mosoriob I have a very hacky script to download gpm data; it basically generates a curl commands for each file that needs to be downloaded and writes them into a bash script file. Since this was a (supposedly) a one-time effort, it should probably be rewritten in a more maintainable manner. In the meantime, let me know if this will suffice for now?

import datetime

earthdata_username = 'read from ENV variable'
earthdata_password = 'read from ENV variable'

def generate_download_links_for_date(input_date, download_dir):
    day_commands = []

    day_of_year = input_date.strftime("%j")
    year = input_date.strftime("%Y")
    date_str = input_date.strftime("%Y%m%d")

    start = datetime.datetime(input_date.year, input_date.month, input_date.day, 0, 0, 0)
    num_thirty_min_intervals = 24 * 2

    for i in range(num_thirty_min_intervals):
        interval_start = start + datetime.timedelta(minutes=30*i)
        interval_end = start + datetime.timedelta(minutes=30*(i+1)) - datetime.timedelta(seconds=1)

        interval_start_str = interval_start.strftime("%H%M%S")
        interval_end_str = interval_end.strftime("%H%M%S")

        minutes_str = str(30*i).zfill(4)

        url_prefix = f"https://gpm1.gesdisc.eosdis.nasa.gov/opendap/hyrax/GPM_L3/GPM_3IMERGHHE.06/{year}/{day_of_year}"
        filename = f"3B-HHR-E.MS.MRG.3IMERG.{date_str}-S{interval_start_str}-E{interval_end_str}.{minutes_str}.V06B.HDF5.nc4"
        download_url = f"{url_prefix}/{filename}"
        download_target = f"{download_dir}/{day_of_year}/{filename}"

        curl_command = f"curl -n -c ~/.urs_cookies -b ~/.urs_cookies -L --url {download_url} --create-dirs -o {download_target}"

        day_commands.append(curl_command)

    return day_commands

commands = []

date_start = datetime.datetime.strptime("2014-08-01", "%Y-%m-%d")
date_end = datetime.datetime.strptime("2014-09-01", "%Y-%m-%d")

arya_download_dir = f"/data/mint/gpm_{date_start.strftime('%Y%m%d')}_{date_end.strftime('%Y%m%d')}"

delta_days = (date_end - date_start).days

for i in range(delta_days+1):
    cur_date = date_start + datetime.timedelta(days=i)

    commands += generate_download_links_for_date(cur_date, arya_download_dir)

netrc_string = f"machine urs.earthdata.nasa.gov login {earthdata_username}  password {earthdata_password}"

with open("download_gpm.sh", "w") as f:
    f.write("#!/bin/bash\n")
    f.write(f'''rm -f .netrc && touch .netrc && echo "{netrc_string}" >> .netrc && chmod 0600 .netrc''' + "\n")
    f.write('''rm -f .urs_cookies && touch .urs_cookies''' + "\n")
    f.write("\n".join(commands))
khider commented 3 years ago

This is the 30min one? So the one that takes quite a bit of time, correct?

we also need CHIRPS

Endpoint: https://data.chc.ucsb.edu/products/CHIRPS-2.0/africa_6-hourly/

Do we have a script for that one as well?

dnfeldman commented 3 years ago

^ Yeah, it's the 30 min one and yeah, it usually takes a bit of time to download (~10-30s per file)

And we don't have download scripts for CHIRPS; UCSB folks were pushing data to the data catalog directly.

khider commented 3 years ago

Since they left the program, I'm assuming they are no longer pushing anything?

dnfeldman commented 3 years ago

yeah, it doesn't look like there has been any new activity for over a year