hytest-org / hytest

https://hytest-org.github.io/hytest/
22 stars 13 forks source link

rechunk SSEBop MODIS daily data #506

Open amsnyder opened 4 months ago

amsnyder commented 4 months ago

@thodson-usgs wrote a pangeo-forge recipe to rechunk this data, but it hasn't been run so we want to reproduce this in a notebook to complete the work. The recipe you will want to duplicate is here, and the data source is here.

You can download the data inputs on Hovenweep here: /caldera/hovenweep/projects/usgs/water/impd/hytest/rechunking/ssebop_modis_daily

amsnyder commented 4 months ago

I think the best process to grab the files you will be working on is to navigate into the directory where you will store the data and download them with wget. For example, the data for the year 2000 can be downloaded with wget https://edcintl.cr.usgs.gov/downloads/sciweb1/shared/uswem/web/conus/eta/modis_eta/daily/downloads/2000.zip (you will see this data already in the directory I created for you). You can get the urls for download by right clicking them and then selecting 'Copy link'. It's probably a good idea to document all of this so your notebooks are fully reproducible, so you could start your python notebook with a cell containing the command line commands you ran - you just put an exclamation mark in front of the command to let the notebook know it is a command line call.

There are plenty of other ways to do this as well so feel free to choose another approach if you would prefer that.

amsnyder commented 4 months ago

And I'm copying the information about setting up your jupyter environment on Hovenweep here just to keep everything together:

  1. Log into Hovenweep using OnDemand
  2. open a terminal and follow these instrcutions to get miniconda installed for yourself: Python Environment Setup with Conda - ARC HPC Portal (usgs.gov). Stop after the line where you source conda (don't create the example environment they show there).
  3. Copy the environment parker has been successfully working with into your home directory on Hovenweep (shared with you in chat, but we can find a better home for this file on Github sometime soon). Navigate to your home directory in the terminal and run conda env create --file ./pangeo_310.yml to create the environment.
  4. Run the following commands to get that environment to show up as a kernel for your jupyter notebooks:
    conda activate pangeo_310
    ipython kernel install --user --name=pangeo_310
    conda deactivate
amsnyder commented 3 months ago

These tiffs do not have timestamps stored as a dimension when you open each file. Therefore, we need to add this information to open all the datasets up using xr.open_mfdataset(). There are 3 approaches we discussed using:

  1. pass a preprocessing function to xr.open_mfdataset() that extracts the date from the filenames and adds it as a dimension. The following function seems to be working for this:

    def parse_dates_from_filename(ds):
    var = next(var for var in ds)
    path=ds[var].encoding['source']
    date_str = path.split('det')[1].split('.modisSSEBopETactual')[0]
    year = date_str[0:4]
    doy = date_str[4:]
    strt_date = date(int(year), 1, 1)
    final_date = strt_date + timedelta(days=int(doy) - 1)
    return ds.expand_dims(time=[pd.to_datetime(final_date)])
  2. add a random timestamp (e.g. stamp each file with datetime.utcnow()) for the time dimension when we open up the files and then overwrite it with the correct timestamps later. @pnorton-usgs tried this and the sequence of the time steps was not maintained. Even when he passed in an explicit list of filenames in order (instead of using a wildcard), the time steps appeared out of order in the resulting dataset. The open_mfdataset() function may be opening items in that file list out of order, or it may be opening several at the exact same time, giving them the same timestamp.

  3. Explicitly open up each file in a file list and concatenate it using something like da = xr.concat([xr.open_rasterio(ii) for ii in filelist], dim=time_var). Here time_var is a manually constructed xarray variable with the timestamps in the same order as the file list:

    st = xr.cftime_range(start='2004', periods=12, freq='MS')
    time_units = 'days since 1990-01-01'
    time_calendar = 'standard'
    time_var = xr.Variable('time', cf.date2num(st, units=time_units, calendar=time_calendar),
                       attrs=dict(units=time_units, calendar=time_calendar))
amsnyder commented 3 months ago

One issue we have noticed with the dataset it that certain years that are not leap years have data available for 366 days of the year. Specifically, the dates 2019366 and 2021366 are available and those do not fall on leap years. @pnorton-usgs opened those files and compared 2019366 to 2020001 and the data did not match. This is a concern we need to look into by looking around the source documentation for an explanation or reaching out to the data provider to ask what is going on. 2021366 could not be compared to 2022001 because the dataset ends at the end of 2021, so there is no data available for 2022.

amsnyder commented 3 months ago

@pnorton-usgs also noticed some issues related to the scale factor and data types that he can comment on here.

amsnyder commented 3 months ago

@pnorton-usgs was able to create a zarr with one year of data (using rechunker, a different method than we were using in the notebook) and it took a little over an hour to run. This is quite slow given that the whole dataset is under 20GB. We think this could be because the data is stored as a whole map in each file, with each file representing only one time step. Every chunk that is being written has to open up several files to get the necessary data values, write out the chunk, then close them. It is then reopening these files many times as they are needed to write other data chunks. We are not totally sure that this is what is going on, but it is our best guess. One solution to look into would be to persist the data in memory with dask the whole time while the rechunking is going on so the same file is not being opened/closed multiple times (if this is what is causing the calculation to run so slowly).

thodson-usgs commented 3 months ago

@amsnyder, my understanding is rechunker is smarter than that, but the intermediate store is indeed the bottleneck. Another work around is to increase your worker memory and/or reduce your time chunk to negate the need for an intermediate store. This is the old fashion way--which rechunker sought to improve upon--but it's still powerful: much faster and can eliminate the need for a reduce step (i.e. nice for serverless)

pnorton-usgs commented 3 months ago

@pnorton-usgs also noticed some issues related to the scale factor and data types that he can comment on here.

What I found is that the documentation (https://earlywarning.usgs.gov/docs/USA_DAILYSSEBopETa_Oct2019.pdf) states that to get the actual value in millimeters the stored value should be divided by 1000, however, the TIFF only shows a scale factor of 1. Because of this the values will not be probably computed when the file is opened. I believe we can load the files with mask_and_scale=False and then update the scale_factor attribute before writing the zarr.

dbrannonUSGS commented 1 month ago

The data provider has updated the dataset to be chunked. This new data has some incongruent factors that complicate the chunking process. Data for years 2000-2017 and 2018-2024 can be successfully rechunked in their respective groupings but produce "non-monotonic" dimensions when combined. Work is continuing to fix this issue.

kjdoore commented 1 month ago

@dbrannonUSGS, I have found the same issue in the monthly data. Here is the code snippet for how I rectified this:

ds1 = ds1.sel(lon=ds2["lon"], lat=ds2["lat"], method="nearest", tolerance=1e-10)
ds1, ds2 = xr.align(ds1, ds2, join="override", exclude="time")

ds1 and ds2 are the datasets for the two time periods. The snippet also assumes the xarray.coords are "lat", "lon", and "time". You will simply need to adjust if they are different or if the tolerance is too low. The easiest way to check tolerance is to run:

print(f"Maximum absolute difference in latitude: {np.abs(ds1.lat - ds2.lat).max()}")
print(f"Maximum absolute difference in longitude: {np.abs(ds1.lon - ds2.lon).max()}")

FYI, a difference of even $10^{-8}$ degrees on earth's surface is a fraction of a millimeter.

amsnyder commented 1 month ago

@pnorton-usgs could you review @dbrannonUSGS's work which has been copied into https://github.com/hytest-org/ssebop-modis-daily-to-zarr?

I messed up the creation of the repo and didn't make an empty main branch - so the work is on initial_commit. If you have any suggested changes, you can just create a new branch against initial_commit and open a PR with your suggestions. I will set up the main branch once you do a review of the code.

@dbrannonUSGS can you comment here with the location of the zarr you created so @pnorton-usgs can take a look?