Open miike opened 6 years ago
I guess BigQuery added this type very recently (it is also in beta status), because I noticed it just last week. I definitely believe we do need to support it.
Just converting geo_latitude
/geo_longitude
fields in enriched event into new GEOGRAPHY
column would not be a generic enough solution (although this is what we do for Elasticsearch) - we want to support this type across many other schemas (com.snowplowanalytics.snowplow/geolocation_context
, com.clearbit.enrichment/person
, com.mandrill/message_opened
and many other schemas from Iglu Central contain geo information). Also, this sounds like a common enough task - many DBs can support different kinds of custom types.
I see it as some kind of "Loader enrichment" which transforms complex JSON types into DB's native, i.e. we configure that in iglu:com.acme/event/jsonschema/1-*-*:#property/lat
is latitude and iglu:com.acme/event/jsonschema/1-*-*:#property/long
is longitude and Loader (BigQuery or other) transforms it into native X:Y
. But main questions is where should we encode this? I don't want to encode this in JSON Schema because:
On the other hard we can add it to configuration of loader, but we're mixing concerns here: analysts should not ask DevOps to change configuration every time they've added new schema with geo data. Also, we don't want to restart the job just to reload this minor piece of configuration.
Last option is to try to make Loader smart enough to figure out these columns by itself:
number
minimum
/maximum
)So far I prefer this option, its main disadvantage is that it is slightly implicit and magical. What if 3rd party schema encodes geo data with some completely unexpected properties, such as lt
/ln
? What's the chance that we'll turn into GEOGRAPHY
something that is not geo data really?
Agreed - it feels awkward putting logic that is essentially transformations into a JSON schema but you've raised the excellent point that haven't this hard coded in to the codebase makes this tricky to maintain from an analyst point of view.
The magic option sounds desirable but it still feels like it needs some sort of high level configuration so whatever type inferences we are making don't incorrectly cast objects (whether that's GEOGRAPHY or another type in the future).
Does having a system that allows UDFs make sense in terms of striking a balance between flexibility and configuration? e.g., something like the Javascript enrichment for transformation - though it wouldn't necessarily need to be Javascript.
Given that BigQuery now has some geography support does it make sense to convert the latitude / longitude columns to a single
GEOGRAPHY
column of type ST_GEOGPOINT?