pacificclimate / station-data-portal

Data portal for station data (e.g., BC Station Data, a.k.a. PCDS; Yukon-NWT Station Data); React app
0 stars 0 forks source link

Review climatology filter #160

Closed jameshiebert closed 1 year ago

jameshiebert commented 1 year ago

The climatology filter in the PDP and the station data portal is reporting more available climatologies that are actually available. If we look in the database and check the number of climatologies available, we see that it's around 41k.

crmp=> select count(*) from obs_raw natural join meta_vars where net_var_name in ('Tx_Climatology', 'Tn_Climatology', 'Precip_Climatology');
 count 
-------
 41285
(1 row)

However, the BC PCDS portal is reporting around 25M

image

Likewise, we don't have any climatologies for the Yukon/Nortwest Territories, however the portal is reporting that we do.

I suspect that this stems from the fact that when we first added climatologies we didn't fully utilize the cell_method attribute, and we used cell_method ~ '(within|over)' as a proxy for climatological variables. Since then we have taken to using within and over as part of cell_method often and I believe that all of these variables are (incorrectly) being reported as climatologies:

crmp=> select net_var_name, cell_method from meta_vars where cell_method ~ '(within|over)' and net_var_name !~ 'Climatology';
       net_var_name        |             cell_method              
---------------------------+--------------------------------------
 HOURLY_PRECIPITATION      | t: sum within days interval: daily
 PRECIPITATION_GAUGE_TOTAL | t: sum over days interval: irregular
 Precip                    | t: sum over days interval: irregular
 RGuage_mm                 | t: sum over days interval: irregular
(4 rows)
rod-glover commented 1 year ago

The discrimination of climatology variables is implemented in the frontend, so this issue is in the right repo.

As noted, the current discriminator is, in JS regex, /(within|over)/.test(cell_method). I think that was copied from the legacy station data portal.

We need a new discriminator, ideally one that is robust and technically correct.

Alternative A: The query above suggests that we can use /Climatology/.test(net_var_name). There are 75 such variables, and their records indicate clearly that they are climatology variables. I am not sure how robust that is, in the sense that it is highly subject to human error when naming variables, and it constrains variable names. It's not technical so much as heuristic, which may be okay and/or the best we can actually do. But let's consider other options.

Aside: We could make the discriminator a configuration value; that is, use a regex (and variable specification) that is configured externally.

Alternative B: Results from the net_var_name query reveals that long_description also always includes the term "Climatological". For example, "Climatological mean of monthly total precipitation". This is no more robust or technical than the net_var_name test.

Alternative C: Refine the cell_method matching. At present, all the records matched on /Climatology/.test(net_var_name) have cell_method ending in t: mean over years. That alone is not really robust enough, but we could expand that test as follows:

Alternative D: It would be nice to use as a discriminator the case that a variable is linked to network 'PCIC Climate Variables'. However only three of the 75 variables identified as climatological above match this condition. I think it's a no-go unless the database is in error and can (i.e., should) be fixed so that all climatological variables are linked to that network. I see an obvious objection to that: We evidently at present use the network link to identify what network the climatology is computed from (presumably, it is computed from observational variables in that network). Linking all climo vars to the climatology network would destroy that information. Sounds like a bad idea. Which raises the question: What does the climatology network actually signify?

rod-glover commented 1 year ago

After a little thought, I like Alternative C: refine the cell_method matching. It is the most technically correct. It only (hah) depends on cell_method being maintained correctly, and it does look at present as if that is the case.

I also like the idea of making this configurable, so that we don't have to make a new release every time we discover that our discrimination is not what we want.

rod-glover commented 1 year ago

Proceeding with Alternative C.

We have a standard applicable to discerning climatological variables from cell_method (properly, cell_methods, but someone made an error a long time back in our database definition). Ref. NetCDF Climate and Forecast (CF) Metadata Conventions, section 7.4. Climatological Statistics.

Summary: cell_method(s) for climatological variables must take one of the following 3 forms:

  1. time: method1 within years time: method2 over years
  2. time: method1 within days time: method2 over days
  3. time: method1 within days time: method2 over days time: method3 over years

We have to check whether the vars we have identified as climatological match this pattern.

Note: We have extended this to cover streamflow. The commentary is not entirely reassuring, but maybe we're good inside the CRMP type databases, and/or we can fix them. If this proves too big a task, we will fall back to simpler non-standard tests.

rod-glover commented 1 year ago

Check existing cell_method values in CRMP and Metnorth.

The 3 forms above can be expressed (horribly) as regular expressions, even omitting a few methods that I think unlikely:

  1. time: method1 within years time: method2 over years
    (t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+years\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years
  2. time: method1 within days time: method2 over days
    (t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days
  3. time: method1 within days time: method2 over days time: method3 over years
    (t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years

Let's see if we have any matches in meta_vars.

Form 1:

crmp=> select 't: minimum within years t: mean over years' ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+years\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
 ?column? 
----------
 t
(1 row)

crmp=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+years\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
 vars_id | network_id | net_var_name | cell_method 
---------+------------+--------------+-------------
(0 rows)

metnorth=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+years\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
 vars_id | network_id | net_var_name | cell_method 
---------+------------+--------------+-------------
(0 rows)

Form 2:

crmp=> select 'time: maximum within days   time: mean over days' ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days';
 ?column? 
----------
 t
(1 row)

crmp=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days';
 vars_id | network_id | net_var_name | cell_method 
---------+------------+--------------+-------------
(0 rows)

metnorth=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days';
 vars_id | network_id | net_var_name | cell_method 
---------+------------+--------------+-------------
(0 rows)

Form 3:

crmp=> select 'time: minimum within days   time: mean over days   time: mean over years' ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
 ?column? 
----------
 t
(1 row)

crmp=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
 vars_id | network_id | net_var_name | cell_method 
---------+------------+--------------+-------------
(0 rows)

metnorth=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
 vars_id | network_id | net_var_name | cell_method 
---------+------------+--------------+-------------
(0 rows)

The test queries show that these regexs are not just plain wrong. And the queries against meta_vars show there are no vars with valid climatological cell_method values.

Immediate conclusion: Argh.

Longer conclusion: The task of correcting the cell_method for each climatological variable is likely infeasible for a couple reasons: Workload and numbers.

Workload means that we'd have to correctly reformulate the cell_value for each climatological variable. This takes some expertise and likely requires the assistance of several scientists.

Numbers: There are 75 climatological variables in CRMP, 0 (?!) in Metnorth:

crmp=> select count(*) from meta_vars where net_var_name ~ 'Climat';
 count 
-------
    75

metnorth=> select count(*) from meta_vars where net_var_name ~ 'Climat';
 count 
-------
     0

Summary: We can't use Alternative C.

rod-glover commented 1 year ago

We can fall back on Alternative A, which is to test net_var_name for a match with 'Climatology'. Blindingly easy. Possibly correct.

rod-glover commented 1 year ago

At the moment, we have to use Alternative "B2", which is display_name ~ 'Climatology. Why? Because the backend at this point does not return either net_var_name or long_description. But it does return display_name, and that test matches the existing known climo vars. :sweat:

We can update the metadata backend to return more attributes.

rod-glover commented 1 year ago

The issue does not reside entirely within this repo or station-data-portal-backend (metadata service). The fundamental data is retrieved from a manual matview in the CRMP database. That matview contains climatological observation counts and is implemented through table climo_obs_count_mv, which is updated using the view climo_obs_count_v, with definition

CREATE OR REPLACE VIEW crmp.climo_obs_count_v
 AS
 SELECT count(*) AS count,
    obs_raw.history_id
   FROM obs_raw
     JOIN meta_vars USING (vars_id)
  WHERE meta_vars.cell_method::text ~ '(within|over)'::text
  GROUP BY obs_raw.history_id;

Note the faulty match for on cell_method. We will have to change this to make this table contain valid data. Then there is a hope that the frontend can provide sensible counts.

The hard way to do this is to modify PyCDS. This will come later. Right now the cheap and easy way is to manually update climo_obs_count_v as follows:

CREATE OR REPLACE VIEW crmp.climo_obs_count_v
 AS
 SELECT count(*) AS count,
    obs_raw.history_id
   FROM obs_raw
     JOIN meta_vars USING (vars_id)
  WHERE meta_vars.display_name::text ~ 'Climatology'::text
  GROUP BY obs_raw.history_id;
rod-glover commented 1 year ago

Way too many moving parts here.

rod-glover commented 1 year ago

Ideas (one) on how to reduce the complexity of this situation:

UPDATE: type attribute replaced by tags attribute which allows for multiple tags describing type of var. Otherwise the idea is the same.

rod-glover commented 1 year ago

There is a fairly serious problem with climo_obs_count_mv. It appears not to be updated in the way one would hope. The following query shows what I mean:

crmp=> with coc_live as (
    select count(obs_raw_id) as count, obs_raw.history_id
    from obs_raw join meta_vars using (vars_id) 
    where meta_vars.display_name::text ~ 'Climatology'::text
    group by obs_raw.history_id
)
select
    coc_mv.history_id as mv_hx_id, 
    coc_mv.count as mv_count, 
    coc_live.history_id as live_hx_id, 
    coc_live.count as live_count
from 
    climo_obs_count_mv coc_mv 
    full outer join coc_live on coc_mv.history_id = coc_live.history_id
where
    coc_mv.history_id is distinct from coc_live.history_id
    or coc_mv.count is distinct from coc_live.count;

 mv_hx_id | mv_count | live_hx_id | live_count 
----------+----------+------------+------------
     3550 |   326612 |       3550 |         28
     3474 |    39453 |       3474 |         24
     3475 |   297261 |       3475 |         31
     3629 |    14030 |       3629 |         21
     3420 |   172279 |       3420 |         46
     3665 |    33462 |            |           
     3465 |   181496 |       3465 |         24
     3456 |    45977 |       3456 |         15
     3404 |   292012 |       3404 |         24
     3650 |   223284 |       3650 |         48
     3415 |   306640 |       3415 |         48
     3457 |    13980 |       3457 |         22
     3570 |    40419 |       3570 |         12
     3403 |    38507 |       3403 |          9
     3605 |       17 |       3605 |         15
     3592 |    16912 |            |           
     3402 |    70091 |       3402 |         23
     3638 |     3566 |            |           
     3606 |    28516 |            |           
     3591 |        9 |            |           
     3662 |       42 |       3662 |         36
     3542 |   116700 |       3542 |         24
     3516 |    42773 |       3516 |         21
     3630 |     5574 |            |           
     3454 |    45761 |       3454 |         15
     3631 |   213875 |       3631 |         48
    14543 |        1 |            |           
     3401 |    33521 |       3401 |         21
     3546 |    84050 |            |           
    14532 |    24239 |            |           
     3458 |   334993 |       3458 |         45
(31 rows)

The CTE coc_live is the query from climo_obs_count_v, which plays a role in populating climo_obs_count_mv.

The full outer join shows us every record on which climo_obs_count_mv and coc_live differ. In every case, either

In either case, the question is, how did that get to be the case? I looked at the legacy "manual matview" machinery.

It's probably not worth digging further into the legacy manual matview machinery/code; it is complicated, not documented, and will be replaced soon.

For this particular table (there are others), there are two fixes:

I'm thinking to do the expedient fix first: truncate, etc.

rod-glover commented 1 year ago

Summary:

Details: