snowplow-incubator / snowplow-bigquery-loader

Loads Snowplow enriched events into Google BigQuery
21 stars 15 forks source link

Support GEOGRAPHY types #7

Open miike opened 6 years ago

miike commented 6 years ago

Given that BigQuery now has some geography support does it make sense to convert the latitude / longitude columns to a single GEOGRAPHY column of type ST_GEOGPOINT?

chuwy commented 6 years ago

I guess BigQuery added this type very recently (it is also in beta status), because I noticed it just last week. I definitely believe we do need to support it.

Just converting geo_latitude/geo_longitude fields in enriched event into new GEOGRAPHY column would not be a generic enough solution (although this is what we do for Elasticsearch) - we want to support this type across many other schemas (com.snowplowanalytics.snowplow/geolocation_context, com.clearbit.enrichment/person, com.mandrill/message_opened and many other schemas from Iglu Central contain geo information). Also, this sounds like a common enough task - many DBs can support different kinds of custom types.

I see it as some kind of "Loader enrichment" which transforms complex JSON types into DB's native, i.e. we configure that in iglu:com.acme/event/jsonschema/1-*-*:#property/lat is latitude and iglu:com.acme/event/jsonschema/1-*-*:#property/long is longitude and Loader (BigQuery or other) transforms it into native X:Y. But main questions is where should we encode this? I don't want to encode this in JSON Schema because:

On the other hard we can add it to configuration of loader, but we're mixing concerns here: analysts should not ask DevOps to change configuration every time they've added new schema with geo data. Also, we don't want to restart the job just to reload this minor piece of configuration.

Last option is to try to make Loader smart enough to figure out these columns by itself:

  1. Schema contains both "lat"/"lon" or "latitude"/"longitude" properties
  2. Both have type number
  3. Both have appropriate constraints (minimum/maximum)

So far I prefer this option, its main disadvantage is that it is slightly implicit and magical. What if 3rd party schema encodes geo data with some completely unexpected properties, such as lt/ln? What's the chance that we'll turn into GEOGRAPHY something that is not geo data really?

miike commented 6 years ago

Agreed - it feels awkward putting logic that is essentially transformations into a JSON schema but you've raised the excellent point that haven't this hard coded in to the codebase makes this tricky to maintain from an analyst point of view.

The magic option sounds desirable but it still feels like it needs some sort of high level configuration so whatever type inferences we are making don't incorrectly cast objects (whether that's GEOGRAPHY or another type in the future).

Does having a system that allows UDFs make sense in terms of striking a balance between flexibility and configuration? e.g., something like the Javascript enrichment for transformation - though it wouldn't necessarily need to be Javascript.