pacificclimate / pdp

The PCIC Data Portal - Server software to run the entire web application
GNU General Public License v3.0
1 stars 2 forks source link

Max of date range needs to be inclusive of that date, not limiting -- in v 2.8.0 #146

Open faronium opened 4 years ago

faronium commented 4 years ago

This is low priority, but something that may cause some confusion among our users. When specifying a date range the maximum date (which is pre-populated with the current date) limits the time selection to midnight separating that date from the previous. As such, a user looking for the most up to date data will need to change the maximum date to the following day in order to get the most recent data. This seems minor, but I think it's inconsistent with the expectations users may have. The effect is compounded because the data are given in UTC which puts another 7 to 8 hours of offset in the expected date depending on PDT/PST. One could even imagine a user, who was looking for late afternoon data on the present day needing to offset the maximum date by two days which would be highly counter-intuitive.

faronium commented 4 years ago

Suggest altering pdp/pdp/static/calendar.js line ~674

from:

return new CfDatetime.fromDatetime(
                      this,
                      today.getFullYear(), today.getMonth()+1, today.getDate()
                    );

to:

return new CfDatetime.fromDatetime(
                      this,
                      today.getFullYear(), today.getMonth()+1, today.getDate(),23,59
                    );

I think that will force the bounding date to be end of day...

jameshiebert commented 4 years ago

I'm willing to be convinced... but I'm not yet entirely convinced that this is a problem. For a couple of reasons:

  1. This is just a default date. Users are still free and able to change the date to suite their purposes.
  2. The date selection is actually just used to select stations that have data during the time range. Probably 99% of the time, stations that have data for some time t are going to have data for time t - half_a_day so the results will be essentially the same.
  3. This is is near-real-time system for a climatological data archive. I think we're pretty up front about the fact that it doesn't have any operational services level agreements in place and no one should expect that data that was collected a few hours ago is necessarily going to be in the archive. Some of our networks update on a monthly basis, so getting into use cases of users downloading data from a few hours ago is definitely not what this is designed to do.

If I misunderstand, can you tell us more about where and how this has caused problems?

faronium commented 4 years ago

I qualify all of this by acknowledging that this is a fairly minor issue. But, because I was asked to clarify, I'm pushing back.

re 1: First, the filter that gets applied to data shown and downloaded (if clipped to date range) doesn't actually correspond with the displayed default date, it's off by a mix of UTC to local conversion and a essentially floored.to.date date, this offset is simply misleading no matter how minor. Second, Because the default looks like the present day, it seems safe to assume that users would assume that they will be getting the most recent data. Most or all users wouldn't think to hack the date range to ask for data from the future even though that is what's needed to return the most recent data. re 2: no, that is incorrect. if a user chooses to download data clipped to their chosen date range, then the time range will apply. See above for implications of an incorrectly or misleadingly applied filter. You are right that the stations shown using the default "today" date cutoff will almost always be the same as the stations that would be shown if the cutoff were "now". but, that's not really the issue. The problem is that users currently cannot get the most recent observations through intuitive means. 3: I agree with this, but the fact that it's a climatological archive doesn't mean that users wont use it for something else. If it's a trivial change to accurately meet the needs of potential users, then it should be applied. If my code review is correct, this is a code change of one line of javascript.

Simply put, this has caused very minor problems for me outside of work which is what makes me think it might be a more universal issue that other users have stumbled upon. In the past I have tried to use the data portal to download recent observations to confirm what I experienced or observed during time off recreating. I use it because I am lazy and didn't want to write the SQL to do the query. Anyhow, I was always confused as to why the most recent observations from EC_raw weren't available even though they are ingested into the database often within the hour of their observation. Today that happened again and I set out to figure out why the cut-off in the downloaded data was about 24 hours earlier than the time when I requested the data. I had previously thought that it had something to do with the delayed building of materialized views that help the portal operate but today I realized that that's not a factor in downloading data.

To clarify: the observed behaviour is the default max date is now constructed as follows:

1) get today's date in local time floored to Y/M/D 00:00:00 local 2) Because of timezone issues in the DB, the above then acts as a filter on the data as if it were UTC so that the effective cutoff becomes Y/M/D 00:00:00 UTC 3) For observations gathered from BC this then becomes Y/M/D 00:00:00 UTC -7 or -8 depending on PDT/PST.

Downloading data right now with the default max filter of 2019/11/04 yields data that indeed has a max time stamp of 2019/11/03 23:00 (actually 23:12, but that's a crmprtd issue). This corresponds to data that was collected at 2019/11/03 15:00 PST. Downloading by setting max date to 2019/11/05 yields data with a time stamp of 2019/11/04 22:00 corresponding to 2019/11/04 14:00. which is 23 hours ahead of the results given by the default max date. Later today the max date will need to be set two days in advance to capture the most recent data because the effective flooring of the max date will limit the download to 2019/11/04 23:00 UTC or 2019/11/04 15:00 PST even though by 23:00 local time today, there will be 8 more hours of data with a UTC date of 2019/11/05. The actual offset is max t - 32 hours

I suggest either applying my earlier suggestion and modifying to simply t + 24 -tz offset (which I don't claim to be correct or comprehensive because I don't know the full code base) or take the current date to populate the calendar as a stripped version of a fully accurate time stamp and then use that more accurate version of the timestamp along with the local/UTC offset to generate the applied date filter to data downloads.

faronium commented 4 years ago

Set of files to illustrate that this error can be greater than one day thus needs a t + 1day - (tz_offset_hrs/24) where tz_offset_hrs is negative.

1018611_1104.xlsx 1018611_1105.xlsx 1018611_1106.xlsx

rod-glover commented 4 years ago

I can see the problem with the "current date" being used to clip downloaded data. The combination of UTC-tz and "today = UTC hour 00:00:00" does add up to a potential error of 24 + 8 = 32 hours in the worst case in our time zone, worse further west (Hawaii?).

James, I think the point is not so much that we only offer near-real-time data, but that the actual meaning of a date of "today" is not at all obvious to the user, and can lead to some very puzzling time constraints being applied to data downloads.

As to assigning a priority to this, as you both point out, this problem probably has a relatively low impact -- meaning few affected. What makes this of more concern is the mysterious nature of the behaviour. It seems to me to be a little analogous to one of those low-probability, high-impact risks: it's unlikely to happen, but when it happens, you're pretty unhappy.

There may be a simpler solution, which is to allow no (null, empty) value in the time selector, and have that mean -- and be clearly indicated to mean -- "no upper bound"/"latest observation". The downside of this solution is that it does leave the mysterious meaning of "today" in place.

What I can't say right now (i.e., without some time spent investigating), is how big an effort either proposed change would be. The PDP code can be gnarly in places, and even with Faron's helpful research, there may be complexities here that will bite us.

faronium commented 4 years ago

Thanks for addressing this Rod. I think this should be addressed at some point. But I'm a perfectionist which gets me nowhere.

Thanks