bigquery: Consider flattening data structures

kernelci / kcidb

kernelci.org common database tools

GNU General Public License v2.0

27 stars 33 forks source link

bigquery: Consider flattening data structures #212

Open spbnick opened 3 years ago

spbnick commented 3 years ago

Right now we're using RECORD types in our BigQuery schema, and we also use arrays. However, that doesn't work well with all BigQuery features. For example, materialized views don't support structure (RECORD) arguments in ANY_VALUE aggregations, and arrays, which don't support equality operations, don't work with UNION DISTINCT.

Consider flattening our data structures into JSON fields in BigQuery, similarly to what we do in SQLite. We can still extract the data from them server-side with JSON functions (although not so efficiently).

Forked off #183.

spbnick commented 2 years ago

We need to find out whether we should actually do this or not. For example, we might find out that flattened structure would make some (existing or near-future) queries difficult, or impossible. Or we might find out after flattening the structure, that there are other issues preventing the use of materialized views and UNION DISTINCT. Or, finally, we might find out that materialized views don't bring much of a performance benefit, and we can survive without UNION DISTINCT, making the whole effort not worth it. We need to answer these questions (and hunt for more pros and cons), before starting seriously implementing this.

One argument for this change, at least to some extent, could be lack of feature parity between BigQuery and e.g. PostgreSQL as regards to structures and arrays in table columns. If we find out we need PostgreSQL support (#209) after all, it might make the whole implementation simpler (due to more uniformity), if they both had flattened fields. We shouldn't forget about the SQLite driver either, which is very low on features too, so maybe that would be enough of a reason.

spbnick commented 2 years ago

If we figure out we need this after all, here are the possible steps for implementing this:

Change RECORD and ARRAY column types to STRING holding JSON.
Change BigQuery driver to work with the new types: packing/unpacking JSON on loading/querying, including the OO queries. Make use of BigQuery JSON functions/queries where appropriate.
Switch to creating materialized views for de-duplication in the BigQuery driver.
Switch from using UNION to using UNION DISTINCT in BigQuery driver queries.
Implement safe and automatic upgrade of the old schema to the new one, on deployment (pending #262), if possible
Port Grafana dashboards to the new schema