pacificclimate / pycds

An ORM representation of the PCDS and CRMP database
GNU General Public License v3.0
0 stars 0 forks source link

Query for materialized view `collapsed_vars_mv` is likely wrong #180

Closed rod-glover closed 8 months ago

rod-glover commented 1 year ago

This matview now contains rows with column vars with almost certainly undesirable values such as

lwe_thickness_of_precipitation_amountt: sum within months t: mean over years, ...

This is probably due to an unanticipated (but valid) change in the content of the cell_method column for some variables, which means the query for this view/matview is wrong.

At present, he conversion of each contributing cell_method column is expressed in SQLAlchemy by func.regexp_replace(Variable.cell_method, "time: ", "_", "g"). The intention of that replacement is apparently to create a conventional programming-language identifier. The obvious generalization would be to allow for any dimension preceding a colon (so, e.g., "t:"), and to replace all spaces with underscores. That generalization would still not handle all cases, but it would handle many more; it might also produce unfeasibly long identifiers.

Investigate, determine correct query, and create a migration to update the matview.

rod-glover commented 9 months ago

Let's consider what CollapsedVariables (collapsed_vars_mv, hereafter CV) is for.

In turn, let's consider what CrmpNetworkGeoserver (crmp_network_geoserver, herafter CNG) is for.

Summary:

Conclusions:

jameshiebert commented 9 months ago

Getting rid of code... I like it!

rod-glover commented 9 months ago

Key question: How should we update CV to support the driving use case, #50?

Answer:

The problem in pdp_util is a SQLAlchemy clause of the following form:

or_(CrmpNetworkGeoserver.vars.like("%within%"), CrmpNetworkGeoserver.vars.like("%over%"))

We need to replace it with an expression over all variables for a given (station) history of the form variable_tags(Variable).contains(array(['climatology'])).

CrmpNetworkGeoserver.vars is reproduced directly from CollapsedVariables.vars for each history (history_id). Those vars columns roll up information from all variables related to a given history. That roll-up is more generally replaced by

The same argument applies to CNG.display_names / CV.display_names.

Therefore our work here is simply to replace those columns in CV with an array aggregation of the relevant variable id's.

rod-glover commented 9 months ago

Forming an aggregate of the related variable id's is probably not the best way to do this: Use of that aggregate will always involve de-aggregating (unnest) it and then joining to Variable. But we already have an unaggregated version of this table -- indeed it is used in the definition of CV -- called VarsPerHistory (hereafter VPH). VPH is a simple many:many association of History.history_id to Variable.vars_id. So we'd be better off just using VPH in such a query.

But the existence of CV in addition to VPH implies that forming these aggregates is costly and that it is (was) worth preserving the results in a separate matview. In that case we need to replace the now-useless aggregate column vars with a more useful one such as all_variable_tags, which would aggregate (set/array union) values of variable_tags(Variable) over all variables per history into a single array, which can then be checked for whether it contains a specific value (e.g., 'climatology'). The expression

or_(CrmpNetworkGeoserver.vars.like("%within%"), CrmpNetworkGeoserver.vars.like("%over%"))

would then be replaced directly by

CollapsedVariables.all_variable_tags.contains(array(['climatology']))

Under the assumption that it is still unfeasibly costly to compute the aggregates on the fly, this is the logical and lowest-effort correction.

It is worth checking that assumption, however, as the alternative -- on the fly computations -- would allow us to drop this matview altogether. @jameshiebert , do you have any thoughts about this?

rod-glover commented 9 months ago

A further complication -- the usage of CNG/CV in pdp_util at present seems to require a pre-aggregated form. In order to use an unaggregated version (i.e., VPH), code in pdp_util may have to be revised considerably, which would require effort and risk errors. That alone may drive the decision.

rod-glover commented 9 months ago

At present I'm headed for the lowest-effort version, which is to update CV with a new column all_variable_tags, defined as follows:

with xxx as (
select 
    distinct
    history_id, 
    unnest(variable_tags(meta_vars)) as a
from vars_per_history_mv natural join meta_vars
)
select
    history_id,
    array_agg(a order by a) all_variable_tags
from xxx
group by history_id
order by history_id

There is probably a tighter way to do this, but this at least works, and can form the basis for a tighter formulation.

rod-glover commented 9 months ago

Here's a variant, somewhat tighter, for CV and its usage.

CTE aggregated_vars is a helper for formulating CV. Each of the aggregated columns in it contains an array of values ordered by vars_id, grouped by history_id. By itself it is an interesting query and may be useful in other contexts.

CTE collapsed_vars_v is the updated version of CV. It collapses columns from collapsed_vars_array_v into the final form.

Finally we show usage of the new CV in the select following it: Select only those rows (histories) with an associated climatology variable. That is in fact what all the fuss is about.

with aggregated_vars as (
    select
        history_id
        , array_agg(vars_id order by vars_id) as vars_ids
        , array_agg(variable_tags(meta_vars) order by vars_id) as var_tags
        , array_agg(display_name order by vars_id) as display_names
    from vars_per_history_mv natural join meta_vars
    group by history_id
),

collapsed_vars_v as (
    select
        history_id
        , vars_ids
        , array(select distinct * from unnest(var_tags)) as all_var_tags
        , array_to_string(display_names, '|') as display_names
    from aggregated_vars
)

select * 
from collapsed_vars_v
where 'climatology' = any(all_var_tags)

Example query results:

404 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
406 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
407 {429,430,431,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Precipitation Climatology"
409 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
410 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
412 {429,430,431,432,559}   {observation,climatology}   "Precipitation Amount|Rainfall Amount|Snowfall Amount|Surface Snow Depth (Point)|Precipitation Climatology"
...
rod-glover commented 9 months ago

Executing these queries against the full CRMP database is not very time-consuming, < 1 s. It seems as if the maintenance of CV as a matview might be unnecessary, and it could be packaged as a view.

I'm also planning to define the uncollapsed query aggregated_vars as a view, since it may have some utility in future, and hiding it as a CTE inside another query seems counterproductive.

rod-glover commented 9 months ago

To summarize:

Updated query for crmp_network_geoserver:

 SELECT 
    ...
    collapsed_vars_mv.all_var_tags,
    collapsed_vars_mv.display_names
   FROM ...
  WHERE ...

The only change is to replace collapsed_vars_mv.vars with collapsed_vars_mv.all_var_tags.

Migrating CRMP is a fairly big undertaking, even using the pared-down process. As a temporary expedient, prior to applying the migration(s) that implement the above changes, we could do all the things above manually. This makes me nervous, but in the service of solving a serious problem quickly, it is worth considering.

rod-glover commented 9 months ago

On further thought, there's no current need to establish aggregated_vars as an independently standing view. If it turns out it would be useful, we can extract it as a view later. Less overhead.