Query for materialized view `collapsed_vars_mv` is likely wrong

rod-glover commented 1 year ago

This matview now contains rows with column vars with almost certainly undesirable values such as

lwe_thickness_of_precipitation_amountt: sum within months t: mean over years, ...

This is probably due to an unanticipated (but valid) change in the content of the cell_method column for some variables, which means the query for this view/matview is wrong.

At present, he conversion of each contributing cell_method column is expressed in SQLAlchemy by func.regexp_replace(Variable.cell_method, "time: ", "_", "g"). The intention of that replacement is apparently to create a conventional programming-language identifier. The obvious generalization would be to allow for any dimension preceding a colon (so, e.g., "t:"), and to replace all spaces with underscores. That generalization would still not handle all cases, but it would handle many more; it might also produce unfeasibly long identifiers.

Investigate, determine correct query, and create a migration to update the matview.

rod-glover commented 9 months ago

Let's consider what CollapsedVariables (collapsed_vars_mv, hereafter CV) is for.

CV is a matview, so evidently it is used to speed up otherwise infeasibly slow queries.
In PyCDS, CV is used in exactly one place, the view CrmpNetworkGeoserver.
In the CRMP database, CV is used only in the legacy functions collapsed_vars_mv_refresh, collapsed_vars_mv_refresh_row, which are used to maintain the legacy manual matview.
In pydap-extras, CV is not used directly anywhere.
In pdp_util, CV is not used directly anywhere.
In pdp, CV is referred to in the documentation for how (now incorrectly; see above) an identifier is formed from the standard_name and cell_method columns.

In turn, let's consider what CrmpNetworkGeoserver (crmp_network_geoserver, herafter CNG) is for.

CNG is a view.
CNG is documented in PyCDS as follows: "This view is used by the PDP Geoserver backend for generating station map layers." since the PDP Geoserver is no longer in service (its function has been replaced by markers in the Met Data Portal), we can discount this usage.
In the CRMP database, CNG is not used directly anywhere.
In pydap-extras, CNG is not used directly anywhere.
In pdp_util, CNG is used:
- in counts.py, to obtain station_ids -- not relevant
- in filters.py, the column vars is used:
- to distinguish climatological from observational variables. This is being addressed in https://github.com/pacificclimate/pdp_util/issues/50
- to match against input-var values in the query parameters. This is relevant.
- in util.py, the function get_stn_list uses CNG in its queries. It accepts two parameters:
- sql_constraints, which is elsewhere passed filters noted above. This can include reference to CNG.vars.
- to_select, default [CNG.network_name, CNG.native_id]. and could potentially take any value, including CrmpNetworkGeoserver.vars.
- In short, we must preserve vars as it is because client apps (namely, Met Data Portal) depend on it. It can be removed if MDP is updated to submit different
In pdp, CNG is not used directly anywhere in branches master or pcds-only. There is considerable mention of it in documentation, which should be removed.
In station-data-portal-backend, CNG content is served by the endpoint /crmp_network_geoserver. That endpoint is not used by station-data-portal, and IIRC, it was used only manually for experimental and debugging purposes.

Summary:

CV is used only in the view CrmpNetworkGeoserver.
CNG is used only in two places:
- In pdp_util, partially under correction.
- In station-data-portal-backend, for unimportant, experimental purposes.

Conclusions:

We cannot change the definition of CV.vars.
We must add a variable_tags column to CV to support the correction in https://github.com/pacificclimate/pdp_util/issues/50 .
~The part of the query referred to here, where an "identifier" is formed from Variable columns standard_name and cell_method, forms a column that is probably completely unnecessary.~
~That column should be replaced simply by a collection of the Variable id's so that any desired operation (and very likely different from the identifier formation) can be performed. The same is true for the CV column display_names.~
CNG should be updated accordingly, or dropped altogether, according to how https://github.com/pacificclimate/pdp_util/issues/50 is resolved.

jameshiebert commented 9 months ago

Getting rid of code... I like it!

rod-glover commented 9 months ago

Key question: How should we update CV to support the driving use case, #50?

Answer:

The problem in pdp_util is a SQLAlchemy clause of the following form:

or_(CrmpNetworkGeoserver.vars.like("%within%"), CrmpNetworkGeoserver.vars.like("%over%"))

We need to replace it with an expression over all variables for a given (station) history of the form variable_tags(Variable).contains(array(['climatology'])).

CrmpNetworkGeoserver.vars is reproduced directly from CollapsedVariables.vars for each history (history_id). Those vars columns roll up information from all variables related to a given history. That roll-up is more generally replaced by

a list (array) of in CV/CNG of all Variable id's
plus a join of Variable against that list (that sounds tricky - research needed).

The same argument applies to CNG.display_names / CV.display_names.

Therefore our work here is simply to replace those columns in CV with an array aggregation of the relevant variable id's.

rod-glover commented 9 months ago

Forming an aggregate of the related variable id's is probably not the best way to do this: Use of that aggregate will always involve de-aggregating (unnest) it and then joining to Variable. But we already have an unaggregated version of this table -- indeed it is used in the definition of CV -- called VarsPerHistory (hereafter VPH). VPH is a simple many:many association of History.history_id to Variable.vars_id. So we'd be better off just using VPH in such a query.

But the existence of CV in addition to VPH implies that forming these aggregates is costly and that it is (was) worth preserving the results in a separate matview. In that case we need to replace the now-useless aggregate column vars with a more useful one such as all_variable_tags, which would aggregate (set/array union) values of variable_tags(Variable) over all variables per history into a single array, which can then be checked for whether it contains a specific value (e.g., 'climatology'). The expression

or_(CrmpNetworkGeoserver.vars.like("%within%"), CrmpNetworkGeoserver.vars.like("%over%"))

would then be replaced directly by

CollapsedVariables.all_variable_tags.contains(array(['climatology']))

Under the assumption that it is still unfeasibly costly to compute the aggregates on the fly, this is the logical and lowest-effort correction.

It is worth checking that assumption, however, as the alternative -- on the fly computations -- would allow us to drop this matview altogether. @jameshiebert , do you have any thoughts about this?

rod-glover commented 9 months ago

A further complication -- the usage of CNG/CV in pdp_util at present seems to require a pre-aggregated form. In order to use an unaggregated version (i.e., VPH), code in pdp_util may have to be revised considerably, which would require effort and risk errors. That alone may drive the decision.

rod-glover commented 9 months ago

At present I'm headed for the lowest-effort version, which is to update CV with a new column all_variable_tags, defined as follows:

with xxx as (
select 
    distinct
    history_id, 
    unnest(variable_tags(meta_vars)) as a
from vars_per_history_mv natural join meta_vars
)
select
    history_id,
    array_agg(a order by a) all_variable_tags
from xxx
group by history_id
order by history_id

There is probably a tighter way to do this, but this at least works, and can form the basis for a tighter formulation.

rod-glover commented 9 months ago

Here's a variant, somewhat tighter, for CV and its usage.

CTE aggregated_vars is a helper for formulating CV. Each of the aggregated columns in it contains an array of values ordered by vars_id, grouped by history_id. By itself it is an interesting query and may be useful in other contexts.

CTE collapsed_vars_v is the updated version of CV. It collapses columns from collapsed_vars_array_v into the final form.

Column all_var_tags is collapsed by flattening the nested arrays of variable tags and selecting only the distinct elements.
Column display_names is collapsed by mapping it to a single string.

Finally we show usage of the new CV in the select following it: Select only those rows (histories) with an associated climatology variable. That is in fact what all the fuss is about.

with aggregated_vars as (
    select
        history_id
        , array_agg(vars_id order by vars_id) as vars_ids
        , array_agg(variable_tags(meta_vars) order by vars_id) as var_tags
        , array_agg(display_name order by vars_id) as display_names
    from vars_per_history_mv natural join meta_vars
    group by history_id
),

collapsed_vars_v as (
    select
        history_id
        , vars_ids
        , array(select distinct * from unnest(var_tags)) as all_var_tags
        , array_to_string(display_names, '|') as display_names
    from aggregated_vars
)

select * 
from collapsed_vars_v
where 'climatology' = any(all_var_tags)

Example query results:

404 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
406 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
407 {429,430,431,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Precipitation Climatology"
409 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
410 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
412 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
...

rod-glover commented 9 months ago

Executing these queries against the full CRMP database is not very time-consuming, < 1 s. It seems as if the maintenance of CV as a matview might be unnecessary, and it could be packaged as a view.

I'm also planning to define the uncollapsed query aggregated_vars as a view, since it may have some utility in future, and hiding it as a CTE inside another query seems counterproductive.

rod-glover commented 9 months ago

To summarize:

Add view aggregated_vars as above.
Update matview collapsed_vars_mv as above.
Update view crmp_network_geoserver as described below.
It's not clear that the column display_names has any current use, but it costs little to leave it in place for now and it could avoid some trouble.

Updated query for crmp_network_geoserver:

 SELECT 
    ...
    collapsed_vars_mv.all_var_tags,
    collapsed_vars_mv.display_names
   FROM ...
  WHERE ...

The only change is to replace collapsed_vars_mv.vars with collapsed_vars_mv.all_var_tags.

Migrating CRMP is a fairly big undertaking, even using the pared-down process. As a temporary expedient, prior to applying the migration(s) that implement the above changes, we could do all the things above manually. This makes me nervous, but in the service of solving a serious problem quickly, it is worth considering.

rod-glover commented 9 months ago

On further thought, there's no current need to establish aggregated_vars as an independently standing view. If it turns out it would be useful, we can extract it as a view later. Less overhead.

pacificclimate / pycds

Query for materialized view `collapsed_vars_mv` is likely wrong #180