Closed jameshiebert closed 1 year ago
The discrimination of climatology variables is implemented in the frontend, so this issue is in the right repo.
As noted, the current discriminator is, in JS regex, /(within|over)/.test(cell_method)
. I think that was copied from the legacy station data portal.
We need a new discriminator, ideally one that is robust and technically correct.
Alternative A: The query above suggests that we can use /Climatology/.test(net_var_name)
. There are 75 such variables, and their records indicate clearly that they are climatology variables. I am not sure how robust that is, in the sense that it is highly subject to human error when naming variables, and it constrains variable names. It's not technical so much as heuristic, which may be okay and/or the best we can actually do. But let's consider other options.
Aside: We could make the discriminator a configuration value; that is, use a regex (and variable specification) that is configured externally.
Alternative B: Results from the net_var_name
query reveals that long_description
also always includes the term "Climatological". For example, "Climatological mean of monthly total precipitation". This is no more robust or technical than the net_var_name
test.
Alternative C: Refine the cell_method
matching. At present, all the records matched on /Climatology/.test(net_var_name)
have cell_method
ending in t: mean over years
. That alone is not really robust enough, but we could expand that test as follows:
mean over months
valid?)t
)? probably should leave this alone.Alternative D: It would be nice to use as a discriminator the case that a variable is linked to network 'PCIC Climate Variables'. However only three of the 75 variables identified as climatological above match this condition. I think it's a no-go unless the database is in error and can (i.e., should) be fixed so that all climatological variables are linked to that network. I see an obvious objection to that: We evidently at present use the network link to identify what network the climatology is computed from (presumably, it is computed from observational variables in that network). Linking all climo vars to the climatology network would destroy that information. Sounds like a bad idea. Which raises the question: What does the climatology network actually signify?
After a little thought, I like Alternative C: refine the cell_method
matching. It is the most technically correct. It only (hah) depends on cell_method
being maintained correctly, and it does look at present as if that is the case.
I also like the idea of making this configurable, so that we don't have to make a new release every time we discover that our discrimination is not what we want.
Proceeding with Alternative C.
We have a standard applicable to discerning climatological variables from cell_method
(properly, cell_methods
, but someone made an error a long time back in our database definition). Ref. NetCDF Climate and Forecast (CF) Metadata Conventions, section 7.4. Climatological Statistics.
Summary: cell_method(s)
for climatological variables must take one of the following 3 forms:
time: method1 within years time: method2 over years
time: method1 within days time: method2 over days
time: method1 within days time: method2 over days time: method3 over years
We have to check whether the vars we have identified as climatological match this pattern.
Note: We have extended this to cover streamflow. The commentary is not entirely reassuring, but maybe we're good inside the CRMP type databases, and/or we can fix them. If this proves too big a task, we will fall back to simpler non-standard tests.
Check existing cell_method
values in CRMP and Metnorth.
The 3 forms above can be expressed (horribly) as regular expressions, even omitting a few methods that I think unlikely:
time: method1 within years time: method2 over years
(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+years\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years
time: method1 within days time: method2 over days
(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days
time: method1 within days time: method2 over days time: method3 over years
(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years
Let's see if we have any matches in meta_vars
.
Form 1:
crmp=> select 't: minimum within years t: mean over years' ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+years\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
?column?
----------
t
(1 row)
crmp=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+years\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
vars_id | network_id | net_var_name | cell_method
---------+------------+--------------+-------------
(0 rows)
metnorth=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+years\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
vars_id | network_id | net_var_name | cell_method
---------+------------+--------------+-------------
(0 rows)
Form 2:
crmp=> select 'time: maximum within days time: mean over days' ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days';
?column?
----------
t
(1 row)
crmp=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days';
vars_id | network_id | net_var_name | cell_method
---------+------------+--------------+-------------
(0 rows)
metnorth=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days';
vars_id | network_id | net_var_name | cell_method
---------+------------+--------------+-------------
(0 rows)
Form 3:
crmp=> select 'time: minimum within days time: mean over days time: mean over years' ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
?column?
----------
t
(1 row)
crmp=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
vars_id | network_id | net_var_name | cell_method
---------+------------+--------------+-------------
(0 rows)
metnorth=> select vars_id, network_id, net_var_name, cell_method from meta_vars where cell_method ~ '(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+within\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+days\s+(t|time):\s+(sum|maximum|median|mid_range|minimum|mean|mode|root_mean_square|standard_deviation|variance)\s+over\s+years';
vars_id | network_id | net_var_name | cell_method
---------+------------+--------------+-------------
(0 rows)
The test queries show that these regexs are not just plain wrong. And the queries against meta_vars
show there are no vars with valid climatological cell_method
values.
Immediate conclusion: Argh.
Longer conclusion: The task of correcting the cell_method
for each climatological variable is likely infeasible for a couple reasons: Workload and numbers.
Workload means that we'd have to correctly reformulate the cell_value
for each climatological variable. This takes some expertise and likely requires the assistance of several scientists.
Numbers: There are 75 climatological variables in CRMP, 0 (?!) in Metnorth:
crmp=> select count(*) from meta_vars where net_var_name ~ 'Climat';
count
-------
75
metnorth=> select count(*) from meta_vars where net_var_name ~ 'Climat';
count
-------
0
Summary: We can't use Alternative C.
We can fall back on Alternative A, which is to test net_var_name
for a match with 'Climatology'. Blindingly easy. Possibly correct.
At the moment, we have to use Alternative "B2", which is display_name ~ 'Climatology
. Why? Because the backend at this point does not return either net_var_name
or long_description
. But it does return display_name
, and that test matches the existing known climo vars. :sweat:
We can update the metadata backend to return more attributes.
The issue does not reside entirely within this repo or station-data-portal-backend (metadata service). The fundamental data is retrieved from a manual matview in the CRMP database. That matview contains climatological observation counts and is implemented through table climo_obs_count_mv
, which is updated using the view climo_obs_count_v
, with definition
CREATE OR REPLACE VIEW crmp.climo_obs_count_v
AS
SELECT count(*) AS count,
obs_raw.history_id
FROM obs_raw
JOIN meta_vars USING (vars_id)
WHERE meta_vars.cell_method::text ~ '(within|over)'::text
GROUP BY obs_raw.history_id;
Note the faulty match for on cell_method. We will have to change this to make this table contain valid data. Then there is a hope that the frontend can provide sensible counts.
The hard way to do this is to modify PyCDS. This will come later. Right now the cheap and easy way is to manually update climo_obs_count_v
as follows:
CREATE OR REPLACE VIEW crmp.climo_obs_count_v
AS
SELECT count(*) AS count,
obs_raw.history_id
FROM obs_raw
JOIN meta_vars USING (vars_id)
WHERE meta_vars.display_name::text ~ 'Climatology'::text
GROUP BY obs_raw.history_id;
Way too many moving parts here.
Ideas (one) on how to reduce the complexity of this situation:
type
attribute to the responses returned by /variables
. The attribute would take values observation
or climatology
, determined in the backend. The set of values can be extended as needed. This at least moves detailed knowledge of the database into the backend. ~It's not clear that this is much of an improvement.~ On reflection, I think it is. The further back we can put the knowledge about how "climatology" vs. "observation" is distinguished, the better. The frontend should be blissfully ignorant of these things. See https://github.com/pacificclimate/station-data-portal-backend/issues/51UPDATE: type
attribute replaced by tags
attribute which allows for multiple tags describing type of var. Otherwise the idea is the same.
There is a fairly serious problem with climo_obs_count_mv
. It appears not to be updated in the way one would hope. The following query shows what I mean:
crmp=> with coc_live as (
select count(obs_raw_id) as count, obs_raw.history_id
from obs_raw join meta_vars using (vars_id)
where meta_vars.display_name::text ~ 'Climatology'::text
group by obs_raw.history_id
)
select
coc_mv.history_id as mv_hx_id,
coc_mv.count as mv_count,
coc_live.history_id as live_hx_id,
coc_live.count as live_count
from
climo_obs_count_mv coc_mv
full outer join coc_live on coc_mv.history_id = coc_live.history_id
where
coc_mv.history_id is distinct from coc_live.history_id
or coc_mv.count is distinct from coc_live.count;
mv_hx_id | mv_count | live_hx_id | live_count
----------+----------+------------+------------
3550 | 326612 | 3550 | 28
3474 | 39453 | 3474 | 24
3475 | 297261 | 3475 | 31
3629 | 14030 | 3629 | 21
3420 | 172279 | 3420 | 46
3665 | 33462 | |
3465 | 181496 | 3465 | 24
3456 | 45977 | 3456 | 15
3404 | 292012 | 3404 | 24
3650 | 223284 | 3650 | 48
3415 | 306640 | 3415 | 48
3457 | 13980 | 3457 | 22
3570 | 40419 | 3570 | 12
3403 | 38507 | 3403 | 9
3605 | 17 | 3605 | 15
3592 | 16912 | |
3402 | 70091 | 3402 | 23
3638 | 3566 | |
3606 | 28516 | |
3591 | 9 | |
3662 | 42 | 3662 | 36
3542 | 116700 | 3542 | 24
3516 | 42773 | 3516 | 21
3630 | 5574 | |
3454 | 45761 | 3454 | 15
3631 | 213875 | 3631 | 48
14543 | 1 | |
3401 | 33521 | 3401 | 21
3546 | 84050 | |
14532 | 24239 | |
3458 | 334993 | 3458 | 45
(31 rows)
The CTE coc_live
is the query from climo_obs_count_v
, which plays a role in populating climo_obs_count_mv
.
The full outer join shows us every record on which climo_obs_count_mv
and coc_live
differ. In every case, either
climo_obs_count_mv
contains a history id for which there are at present no climo observations climo_obs_count_mv
contains a history id for which the count differs, usually much much higher than the count from the live countIn either case, the question is, how did that get to be the case? I looked at the legacy "manual matview" machinery.
It's probably not worth digging further into the legacy manual matview machinery/code; it is complicated, not documented, and will be replaced soon.
For this particular table (there are others), there are two fixes:
climo_obs_count_mv
climo_obs_count_v
. matview
as follows (copied from function crmp.climo_obs_count_mv_refresh
):
UPDATE matviews
SET last_refresh = now()
WHERE mv_name = 'climo_obs_count_mv';
This will prevent the manual matview machinery from getting confused.
I'm thinking to do the expedient fix first: truncate, etc.
Summary:
Details:
variable_tags
, to the CRMP database, which provides a SSOT for classifying variables (currently, as climatology, observation). (Function soon will be added to Metnorth.)climo_obs_count_mv
and its supporting view climo_obs_count_v
. The updated view uses function variable_tags
(SSOT) to discern what observations should be counted. For details, see this Redmine ticket./variables
endpoint to include a new attribute tags
that is derived from the SSOT function variable_tags
.
The climatology filter in the PDP and the station data portal is reporting more available climatologies that are actually available. If we look in the database and check the number of climatologies available, we see that it's around 41k.
However, the BC PCDS portal is reporting around 25M
Likewise, we don't have any climatologies for the Yukon/Nortwest Territories, however the portal is reporting that we do.
I suspect that this stems from the fact that when we first added climatologies we didn't fully utilize the
cell_method
attribute, and we usedcell_method ~ '(within|over)'
as a proxy for climatological variables. Since then we have taken to usingwithin
andover
as part ofcell_method
often and I believe that all of these variables are (incorrectly) being reported as climatologies: