Closed pboothe closed 4 years ago
I think this is a more fundamental problem. There are three operations that we need to work:
In the case of JSON data that encodes a map[string]string
we need an "Unmarshal" target struct that includes a map[string]string
type. While bigquery.InferSchema
normally fails for map types, we can add a field tag that tells bigquery to ignore that field when inferring the schema. We can also "fix" the schema return by bigquery.InferSchema
so that it include equivalent fields for the map data as an array, e.g. []NameValue
. This could address the first two operations.
However, the third operation needs a source struct for loading data into BigQuery that has the same fields as the schema.
As proposed for the NDTResult struct, we would need to do one of the following:
map[string]string
for JSON, []NameValue
for BigQuery.type ArchivalData struct {
...
ClientMetadataJSON map[string]string `bigquery:"-" json:"ClientMetadata"`
ClientMetadataBQ []NameValue `bigquery:"ClientMetadata" json:"-"`
}
map
types allowed.InferSchema
that automatically translates (for example) map
types to []NameValue
types. And, create our own InferRow
(or similar name) that outputs a dynamic bigquery.Value
map from the native struct type that's compatible with the schema. InferRow
replaces the third operation type.So, 2) is a least-bad version of 1), but it's still a little awkward.
3) may not be possible in all cases so would probably be a temporary solution, but it would make many things simpler until we really can't do that.
4) requires new code that in large part would duplicate the existing bq inference logic, and the InferRow
logic could have performance implications.
I think we want to take this two steps further, and use it to generate documentation too. It should include a short description (e.g. 50 chars) and a long description (a full paragraph). It also needs to have some minimal mechanism to manage scope and version skew: per item tags to include or exclude in certain contexts (e.g. columns that are added for some experiments but not others) and manage any changes in definitions (e.g. If there is a bug fix in an instrument stored in the raw data it should be possible to document both versions, even if the name does not change).
The master needs to be in something like yamal....
Yes, we do want to be able to automatically generate the "faithful" schema documentation from source annotations. Very good reminder.
Quick thoughts on maps: Do we already have json encoded maps in our data? Looks like go BQ api doesn't support maps: "InferSchema returns an error if any of the examined fields is of type uint, uint64, uintptr, map, interface, complex64, complex128, func, or chan. Future versions may handle these cases without error." I can't find any BQ docs discussing a map type, so guessing BQ doesn't have such a thing.
Can we just avoid map[string]string, and instead use a slice of struct{key string, value string}? It means a little more work if we actually need a map in the code, but it avoids messy parts of bigquery.
Also, when we have key:value pairs, I think we should try to handle all standard keys as actual named columns, and only handle unrecognized keys in a generic manner.
Regarding schema inference: I have so far been unsuccessful in creating a go struct from reflection based schema, though I haven't inquired on go-nuts, so maybe it is possible.
For the NDT web100 snapshots, I wrote code that generates map[string]Value directly from the snapshot data. This was efficient, but results in ugly code that is hard to work with, so I'm inclined to avoid that as much as possible for M-Lab 2.0.
I initially hoped to do moderately flexible schema inference, but based on my experience in Q2, and the comments in this issue, I don't think that is practical either. I think we are just going to end up with different parser versions that will have to evolve as the raw data and BQ schemas evolve over time. So perhaps we should put our effort into working out how to make that evolution (and co-existence of different variants) as easy as possible.
BTW, I've only been thinking about Go structs -> BQ inserts. I have no idea how to go about supporting BQ rows -> Go structs, and I suspect it is a bad idea. I think once data is in BQ, we should only access it using SQL. So, I don't understand parts of Peter's initial description that talks about dead data. BQ will happily accept rows that are missing schema elements. It only chokes if you give it structs with fields that don't exist in the schema, and we can already handle that with field tags in the Go structs.
Don't add mirror structs. Instead, make the schema generator better.
Eventually, the schema generator will need to be smart enough to know to generate fields for dead data. I suggest that we adopt a solution that looks like:
To drop the bad1 and bad2 fields from the struct below, change from:
to
That way the name stays there, but it is changed to being a zero-size data element. The schema parser will need to be upgraded to support the relevant annotation.
(this came out of a discussion with @stephen-soltesz , but all flames should be directed at me and not him, as he didn't actually care that much about this issue, while I care a lot)