m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Don't add mirror structs #719

Closed pboothe closed 4 years ago

pboothe commented 5 years ago

Don't add mirror structs. Instead, make the schema generator better.

Eventually, the schema generator will need to be smart enough to know to generate fields for dead data. I suggest that we adopt a solution that looks like:

To drop the bad1 and bad2 fields from the struct below, change from:

type Data struct {
  id string
  bad1, bad2 [1000000]byte
}

to

type Data struct {
  id string
  bad1 struct{} `schematype:[1000000]byte`
  bad2 struct{} `schematype:[1000000]byte`
}

That way the name stays there, but it is changed to being a zero-size data element. The schema parser will need to be upgraded to support the relevant annotation.

(this came out of a discussion with @stephen-soltesz , but all flames should be directed at me and not him, as he didn't actually care that much about this issue, while I care a lot)

stephen-soltesz commented 5 years ago

I think this is a more fundamental problem. There are three operations that we need to work:

  1. Loading archival data into a struct -- i.e. parsing serialized JSON into a struct with all required fields
  2. Inferring BigQuery Schema from a struct
  3. Load data into bigquery from struct

In the case of JSON data that encodes a map[string]string we need an "Unmarshal" target struct that includes a map[string]string type. While bigquery.InferSchema normally fails for map types, we can add a field tag that tells bigquery to ignore that field when inferring the schema. We can also "fix" the schema return by bigquery.InferSchema so that it include equivalent fields for the map data as an array, e.g. []NameValue. This could address the first two operations.

However, the third operation needs a source struct for loading data into BigQuery that has the same fields as the schema.

As proposed for the NDTResult struct, we would need to do one of the following:

  1. Create full mirror structs to capture the two metadata field types, map[string]string for JSON, []NameValue for BigQuery.
  2. Create "mirror fields" in the native struct, one used only for JSON, the other used only for bigquery, where the parser copies data from one to the other. For example:
    type ArchivalData struct {
    ...
    ClientMetadataJSON map[string]string `bigquery:"-" json:"ClientMetadata"`
    ClientMetadataBQ []NameValue `bigquery:"ClientMetadata" json:"-"`
    }
  3. Create the native struct so that it is compatible with bigquery.InferSchema from the beginning. i.e. no map types allowed.
  4. Create our own InferSchema that automatically translates (for example) map types to []NameValue types. And, create our own InferRow (or similar name) that outputs a dynamic bigquery.Value map from the native struct type that's compatible with the schema. InferRow replaces the third operation type.

So, 2) is a least-bad version of 1), but it's still a little awkward. 3) may not be possible in all cases so would probably be a temporary solution, but it would make many things simpler until we really can't do that. 4) requires new code that in large part would duplicate the existing bq inference logic, and the InferRow logic could have performance implications.

mattmathis commented 5 years ago

I think we want to take this two steps further, and use it to generate documentation too. It should include a short description (e.g. 50 chars) and a long description (a full paragraph). It also needs to have some minimal mechanism to manage scope and version skew: per item tags to include or exclude in certain contexts (e.g. columns that are added for some experiments but not others) and manage any changes in definitions (e.g. If there is a bug fix in an instrument stored in the raw data it should be possible to document both versions, even if the name does not change).

The master needs to be in something like yamal....

stephen-soltesz commented 5 years ago

Yes, we do want to be able to automatically generate the "faithful" schema documentation from source annotations. Very good reminder.

gfr10598 commented 5 years ago

Quick thoughts on maps: Do we already have json encoded maps in our data? Looks like go BQ api doesn't support maps: "InferSchema returns an error if any of the examined fields is of type uint, uint64, uintptr, map, interface, complex64, complex128, func, or chan. Future versions may handle these cases without error." I can't find any BQ docs discussing a map type, so guessing BQ doesn't have such a thing.

Can we just avoid map[string]string, and instead use a slice of struct{key string, value string}? It means a little more work if we actually need a map in the code, but it avoids messy parts of bigquery.

Also, when we have key:value pairs, I think we should try to handle all standard keys as actual named columns, and only handle unrecognized keys in a generic manner.

gfr10598 commented 5 years ago

Regarding schema inference: I have so far been unsuccessful in creating a go struct from reflection based schema, though I haven't inquired on go-nuts, so maybe it is possible.

For the NDT web100 snapshots, I wrote code that generates map[string]Value directly from the snapshot data. This was efficient, but results in ugly code that is hard to work with, so I'm inclined to avoid that as much as possible for M-Lab 2.0.

I initially hoped to do moderately flexible schema inference, but based on my experience in Q2, and the comments in this issue, I don't think that is practical either. I think we are just going to end up with different parser versions that will have to evolve as the raw data and BQ schemas evolve over time. So perhaps we should put our effort into working out how to make that evolution (and co-existence of different variants) as easy as possible.

BTW, I've only been thinking about Go structs -> BQ inserts. I have no idea how to go about supporting BQ rows -> Go structs, and I suspect it is a bad idea. I think once data is in BQ, we should only access it using SQL. So, I don't understand parts of Peter's initial description that talks about dead data. BQ will happily accept rows that are missing schema elements. It only chokes if you give it structs with fields that don't exist in the schema, and we can already handle that with field tags in the Go structs.