sungdukyu / LEAP_REU_Dataset_Notebook

1 stars 1 forks source link

Include preprocessing steps into the dataset creation? #12

Open jbusecke opened 1 year ago

jbusecke commented 1 year ago

Seeing the preprocessing in this cell

### Reorganize the temporal dimension/coordinate

#### Add the *time* dimension  
Originally the time information is coded in the variables **ymd** and **tod**. The **sample** index represents the time step count. 

**ymd** includes date information: the first digit indicates the index of year, the next two digits indicate the month and the last three digits indicates the calendar day in the year.

**tod** represents time in the day counted in seconds.

I am wondering if we should process a new version of the dataset that includes this step? This step seems like a pretty canonical example for the Analysis-Ready in ARCO datasets. This is information that an expert on the dataset knows how to do, but is not trivial for other users, and it can be easily added to the dataset without inhibiting any other workflows on it?

I also think now that the data is properly 'archived' on huggingface, we can do this in a more reproducible way using pangeo-forge-recipes. A great first step for this would be to raise an issue over at out data-management repo.

jbusecke commented 1 year ago

I also came up with a more efficient way of creating the time dimension (no need for the for-loop):

def ymd_tod_to_date(ymd:int, tod:int) -> dict:
    year=ymd//10000
    month=ymd%10000//100
    day=ymd%10000%100
    hour=tod//3600
    minute=tod%3600//60
    return dict(year=year, month=month, day=day, hour=hour, minute=minute)

start_date_dict = ymd_tod_to_date(ds['ymd'][0].data, ds['tod'][0].data)
start_date = cftime.DatetimeNoLeap(start_date_dict['year'], start_date_dict['month'], start_date_dict['day'], start_date_dict['hour'], start_date_dict['minute'])
time = xr.cftime_range(start=start_date, freq='1200S', periods=len(ds.ymd))
ds = ds.assign(sample=time).rename({'sample':'time'}).drop(['tod', 'ymd'])
# Check the current **time** dimension, read the timestep
ds.time.values[0:5]

instead of

# loop over all sample points
year=ds['ymd']//10000
month=ds['ymd']%10000//100
day=ds['ymd']%10000%100
hour=ds['tod']//3600
minute=ds['tod']%3600//60

k=0
t = []
for k in range(len(ds['ymd'])):
    t.append(cftime.DatetimeNoLeap(year[k],month[k],day[k],hour[k],minute[k]))
    break

# add the time array to the 'sample' dimension; then, rename
ds['sample'] = t
ds = ds.rename({'sample':'time'})

# now 'time' dimension replaced 'sample' dimension.
ds = ds.drop(['tod','ymd'])

# Check the current **time** dimension, read the timestep
ds.time.values[0:5]

Please let me know if I should make a PR for this.