Investigate why wrfxtrm zarr data seems to be offset by a day

amsnyder commented 10 months ago

The conus404-daily-diagnostic data seems to contain all zeros on the first day (1979-10-01). Then the data values pick up on the second day (1979-10-02). Is this intentional or do we need to shift the data by one day?

pnorton-usgs commented 10 months ago

In the c404 hourly zarr dataset the precipitation variable, PREC_ACC_NC, represents an accumulation of precipitation for the prior 60 minutes at each timestep. For example, timestep 2023-01-01_00:00:00 is the accumulated precip for the prior 60 minutes. When creating the daily-from-hourly zarr dataset this accumulation was taken into account when computing the daily accumulated precipitation. For example, to compute the daily accumulated precipitation from the hourly for 2023-01-01, we summed the precipitation from timesteps 2023-01-01_01:00:00 to 2023-01-02_00:00:00

With the daily diagnostic xtrm zarr dataset the files were converted as-is to the zarr format. It does appear the given dates (e.g. 1980-01-01_00:00:00) do represent the prior day. We could fix this by adjusting the time values in the dataset.

rviger-usgs commented 10 months ago

does fixing it mean dropping the reported daily accumulation for 1979-10-01 from the dataset?

pnorton-usgs commented 10 months ago

No, it just means that 1979-10-01 becomes 1979-09-30. We'll have the same number of days in the dataset, we're just shifting the dates to reflect reality.

amsnyder commented 10 months ago

For the hourly dataset, the first time step is '1979-10-01T00:00:00.000000000'. This would be the rainfall between '1979-09-30T00:00:00.000000000' and '1979-10-01T00:00:00.000000000' - is that right?

amsnyder commented 10 months ago

For wrfxtrm, I guess our options are to either shift the time labels so that the value you get on a given day represents the flux for that day. Or we could add an attribute integration_length of flux over prior 24 hours - but perhaps that is confusing given that it is a flux, rather than an accumulated value.

It sounds like both of our zarr stores currently match the data format of the raw data output, in terms of how the dates/values align, right @pnorton-usgs ? So if we shifted the dates, we would be making the data more intuitive for a data user, but it would now be a slight mismatch with the raw output format?

pnorton-usgs commented 10 months ago

For the hourly dataset, the first time step is '1979-10-01T00:00:00.000000000'. This would be the rainfall between '1979-09-30T00:00:00.000000000' and '1979-10-01T00:00:00.000000000' - is that right?

For 1979-10-0100:00:00 it would represent rainfall for 1979-09-3001:00:00 to 1979-10-01_00:00:00

The raw hourly output time values have not been modified and in the case of the PREC_ACC_NC variable the integration_length is set to accumulated over prior 60 minutes. It was only with the daily (and monthly) datasets that I adjusted the time values to reflect an intuitive understanding of what they represent. The integration_length for the PREC_ACC_NC variable was set to 24-hour accumulation in the daily dataset and to month accumulation in the monthly dataset.

amsnyder commented 10 months ago

For the hourly dataset, the first time step is '1979-10-01T00:00:00.000000000'. This would be the rainfall between '1979-09-30T00:00:00.000000000' and '1979-10-01T00:00:00.000000000' - is that right?

For 1979-10-0100:00:00 it would represent rainfall for 1979-09-3001:00:00 to 1979-10-01_00:00:00

I think we might both have typos lol - we mean 1979-09-30_23:00:00, right? For the hourly data. And the for the daily data you aggregated from hourly, that first time step of the dataset is thrown out because 1979-10-01T00:00:00.000000000 would use 1979-10-01T01:00:00.000000000 to 1979-10-02T00:00:00.000000000?

amsnyder commented 10 months ago

The raw hourly output time values have not been modified and in the case of the PREC_ACC_NC variable the integration_length is set to accumulated over prior 60 minutes. It was only with the daily (and monthly) datasets that I adjusted the time values to reflect an intuitive understanding of what they represent. The integration_length for the PREC_ACC_NC variable was set to 24-hour accumulation in the daily dataset and to month accumulation in the monthly dataset.

This makes sense. I guess I am asking if we should add a label like this to the conus404-daily-diagnostic data from wrfxtrm, or if we should adjust the dates to make them more intuitive (but now out of line with the raw output date formatting).

pnorton-usgs commented 10 months ago

For the hourly dataset, the first time step is '1979-10-01T00:00:00.000000000'. This would be the rainfall between '1979-09-30T00:00:00.000000000' and '1979-10-01T00:00:00.000000000' - is that right?

For 1979-10-0100:00:00 it would represent rainfall for 1979-09-3001:00:00 to 1979-10-01_00:00:00

I think we might both have typos lol - we mean 1979-09-30_23:00:00, right? For the hourly data. And the for the daily data you aggregated from hourly, that first time step of the dataset is thrown out because 1979-10-01T00:00:00.000000000 would use 1979-10-01T01:00:00.000000000 to 1979-10-02T00:00:00.000000000?

I totally missed you were talking about hourly. :)

For the daily timestep: 1979-10-01 would represent rainfall from the hourly for 1979-10-0101:00:00 to 1979-10-0200:00:00

pnorton-usgs commented 10 months ago

The raw hourly output time values have not been modified and in the case of the PREC_ACC_NC variable the integration_length is set to accumulated over prior 60 minutes. It was only with the daily (and monthly) datasets that I adjusted the time values to reflect an intuitive understanding of what they represent. The integration_length for the PREC_ACC_NC variable was set to 24-hour accumulation in the daily dataset and to month accumulation in the monthly dataset.

This makes sense. I guess I am asking if we should add a label like this to the conus404-daily-diagnostic data from wrfxtrm, or if we should adjust the dates to make them more intuitive (but now out of line with the raw output date formatting).

I think we should adjust the dates; IMO it will be a headache for others if we don't.

amsnyder commented 10 months ago

Ok - I am ok with that plan. Maybe you can adjust when you do the update to get the rest of the data through 2022 into the zarr?

amsnyder commented 9 months ago

From Changhai Liu: The data in wrfxtrm files represent the results in the past 24 hours, and the timestamp corresponds to the end time of the 24 hours. The standard timestamp of these files is yyyy-mm-dd_00:00:00. For example, the values in the file with a timestamp 1979-10-02_00:00:00 correspond to the simulation results between 1979-10-01_00:00:00 and 1979-10-02_00:00:00. (note that the time at 1979-10-01_00:00:00 is NOT included.) Since CONUS404 started at 1979-10-01_00:00:00, the first wrfxtrm file (1979-10-01_00:00:00) is all zeros.

In terms of a decision on if we want to shift the dates in conus404-daily-diagnostic, we will wait for input from NCAR via email.

amsnyder commented 9 months ago

Consider adding time_bnds variable to denote the period of time a time step represents, add attribute to time variable to point to time_bnds variable.

hytest-org / hytest

Investigate why wrfxtrm zarr data seems to be offset by a day #409