Closed rod-glover closed 8 months ago
Let's consider what CollapsedVariables
(collapsed_vars_mv
, hereafter CV) is for.
CrmpNetworkGeoserver
.collapsed_vars_mv_refresh
, collapsed_vars_mv_refresh_row
, which are used to maintain the legacy manual matview.pydap-extras
, CV is not used directly anywhere.pdp_util
, CV is not used directly anywhere.pdp
, CV is referred to in the documentation for how (now incorrectly; see above) an identifier is formed from the standard_name
and cell_method
columns.In turn, let's consider what CrmpNetworkGeoserver
(crmp_network_geoserver
, herafter CNG) is for.
pydap-extras
, CNG is not used directly anywhere.pdp_util
, CNG is used:
counts.py
, to obtain station_id
s -- not relevantfilters.py
, the column vars
is used:input-var
values in the query parameters. This is relevant.util.py
, the function get_stn_list
uses CNG in its queries. It accepts two parameters:sql_constraints
, which is elsewhere passed filters noted above. This can include reference to CNG.vars
.to_select
, default [CNG.network_name, CNG.native_id]
. and could potentially take any value, including CrmpNetworkGeoserver.vars
. vars
as it is because client apps (namely, Met Data Portal) depend on it. It can be removed if MDP is updated to submit different pdp
, CNG is not used directly anywhere in branches master
or pcds-only
. There is considerable mention of it in documentation, which should be removed. station-data-portal-backend
, CNG content is served by the endpoint /crmp_network_geoserver
. That endpoint is not used by station-data-portal
, and IIRC, it was used only manually for experimental and debugging purposes.Summary:
CrmpNetworkGeoserver
.pdp_util
, partially under correction.station-data-portal-backend
, for unimportant, experimental purposes.Conclusions:
CV.vars
. variable_tags
column to CV to support the correction in https://github.com/pacificclimate/pdp_util/issues/50 .standard_name
and cell_method
, forms a column that is probably completely unnecessary.~display_names
.~Getting rid of code... I like it!
Key question: How should we update CV to support the driving use case, #50?
Answer:
The problem in pdp_util is a SQLAlchemy clause of the following form:
or_(CrmpNetworkGeoserver.vars.like("%within%"), CrmpNetworkGeoserver.vars.like("%over%"))
We need to replace it with an expression over all variables for a given (station) history of the form variable_tags(Variable).contains(array(['climatology']))
.
CrmpNetworkGeoserver.vars
is reproduced directly from CollapsedVariables.vars
for each history (history_id
). Those vars columns roll up information from all variables related to a given history. That roll-up is more generally replaced by
The same argument applies to CNG.display_names
/ CV.display_names
.
Therefore our work here is simply to replace those columns in CV with an array aggregation of the relevant variable id's.
Forming an aggregate of the related variable id's is probably not the best way to do this: Use of that aggregate will always involve de-aggregating (unnest
) it and then joining to Variable. But we already have an unaggregated version of this table -- indeed it is used in the definition of CV -- called VarsPerHistory
(hereafter VPH). VPH is a simple many:many association of History.history_id
to Variable.vars_id
. So we'd be better off just using VPH in such a query.
But the existence of CV in addition to VPH implies that forming these aggregates is costly and that it is (was) worth preserving the results in a separate matview. In that case we need to replace the now-useless aggregate column vars
with a more useful one such as all_variable_tags
, which would aggregate (set/array union) values of variable_tags(Variable)
over all variables per history into a single array, which can then be checked for whether it contains a specific value (e.g., 'climatology'). The expression
or_(CrmpNetworkGeoserver.vars.like("%within%"), CrmpNetworkGeoserver.vars.like("%over%"))
would then be replaced directly by
CollapsedVariables.all_variable_tags.contains(array(['climatology']))
Under the assumption that it is still unfeasibly costly to compute the aggregates on the fly, this is the logical and lowest-effort correction.
It is worth checking that assumption, however, as the alternative -- on the fly computations -- would allow us to drop this matview altogether. @jameshiebert , do you have any thoughts about this?
A further complication -- the usage of CNG/CV in pdp_util at present seems to require a pre-aggregated form. In order to use an unaggregated version (i.e., VPH), code in pdp_util may have to be revised considerably, which would require effort and risk errors. That alone may drive the decision.
At present I'm headed for the lowest-effort version, which is to update CV with a new column all_variable_tags
, defined as follows:
with xxx as (
select
distinct
history_id,
unnest(variable_tags(meta_vars)) as a
from vars_per_history_mv natural join meta_vars
)
select
history_id,
array_agg(a order by a) all_variable_tags
from xxx
group by history_id
order by history_id
There is probably a tighter way to do this, but this at least works, and can form the basis for a tighter formulation.
Here's a variant, somewhat tighter, for CV and its usage.
CTE aggregated_vars
is a helper for formulating CV. Each of the aggregated columns in it contains an array of values ordered by vars_id
, grouped by history_id
. By itself it is an interesting query and may be useful in other contexts.
CTE collapsed_vars_v
is the updated version of CV. It collapses columns from collapsed_vars_array_v
into the final form.
all_var_tags
is collapsed by flattening the nested arrays of variable tags and selecting only the distinct elements. display_names
is collapsed by mapping it to a single string. Finally we show usage of the new CV in the select following it: Select only those rows (histories) with an associated climatology variable. That is in fact what all the fuss is about.
with aggregated_vars as (
select
history_id
, array_agg(vars_id order by vars_id) as vars_ids
, array_agg(variable_tags(meta_vars) order by vars_id) as var_tags
, array_agg(display_name order by vars_id) as display_names
from vars_per_history_mv natural join meta_vars
group by history_id
),
collapsed_vars_v as (
select
history_id
, vars_ids
, array(select distinct * from unnest(var_tags)) as all_var_tags
, array_to_string(display_names, '|') as display_names
from aggregated_vars
)
select *
from collapsed_vars_v
where 'climatology' = any(all_var_tags)
Example query results:
404 {429,430,431,432,559} {observation,climatology} "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
406 {429,430,431,432,559} {observation,climatology} "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
407 {429,430,431,559} {observation,climatology} "Precipitation Amount|Rainfall Amount|Snowfall Amount|Precipitation Climatology"
409 {429,430,431,432,559} {observation,climatology} "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
410 {429,430,431,432,559} {observation,climatology} "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
412 {429,430,431,432,559} {observation,climatology} "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
...
Executing these queries against the full CRMP database is not very time-consuming, < 1 s. It seems as if the maintenance of CV as a matview might be unnecessary, and it could be packaged as a view.
I'm also planning to define the uncollapsed query aggregated_vars
as a view, since it may have some utility in future, and hiding it as a CTE inside another query seems counterproductive.
To summarize:
aggregated_vars
as above.collapsed_vars_mv
as above.crmp_network_geoserver
as described below.display_names
has any current use, but it costs little to leave it in place for now and it could avoid some trouble.Updated query for crmp_network_geoserver
:
SELECT
...
collapsed_vars_mv.all_var_tags,
collapsed_vars_mv.display_names
FROM ...
WHERE ...
The only change is to replace collapsed_vars_mv.vars
with collapsed_vars_mv.all_var_tags
.
Migrating CRMP is a fairly big undertaking, even using the pared-down process. As a temporary expedient, prior to applying the migration(s) that implement the above changes, we could do all the things above manually. This makes me nervous, but in the service of solving a serious problem quickly, it is worth considering.
On further thought, there's no current need to establish aggregated_vars
as an independently standing view. If it turns out it would be useful, we can extract it as a view later. Less overhead.
This matview now contains rows with column
vars
with almost certainly undesirable values such asThis is probably due to an unanticipated (but valid) change in the content of the
cell_method
column for some variables, which means the query for this view/matview is wrong.At present, he conversion of each contributing
cell_method
column is expressed in SQLAlchemy byfunc.regexp_replace(Variable.cell_method, "time: ", "_", "g")
. The intention of that replacement is apparently to create a conventional programming-language identifier. The obvious generalization would be to allow for any dimension preceding a colon (so, e.g., "t:"), and to replace all spaces with underscores. That generalization would still not handle all cases, but it would handle many more; it might also produce unfeasibly long identifiers.Investigate, determine correct query, and create a migration to update the matview.