googleapis / google-cloud-go

Google Cloud Client Libraries for Go.
https://cloud.google.com/go/docs/reference
Apache License 2.0
3.72k stars 1.27k forks source link

bigquery: support schema inference of nested structs #238

Closed nightlyone closed 7 years ago

nightlyone commented 8 years ago

As requested by @mcgreevy I excluded support of nested structs from #213 and leave it open for discussion and possibly later implementation.

schmohlio commented 7 years ago

I have pretty simple, clean solution for creating Schemas from nested map[string]interface{} objects that created are from Json.Unmarshal. It includes "merging" and coalescing nested repeated structs, which I didn't see accounted for in the patch. Switches are used without reflection bc unmarshal has limited number of types.

Let me know if interested in something and I can PR.

jba commented 7 years ago

We're definitely interested. Unfortunately we don't accept github PRs; we use Gerrit. See CONTRIBUTING.md.

Let me know if that's too much trouble and we'll find another way.

schmohlio commented 7 years ago

no problem, noticed the contributing.md

jba commented 7 years ago

I've finally started to devote some serious attention to this.

We now have a fields package that correctly enumerates the exported fields of a struct according to the Go rules, including embedding. So correct treatment of embedding now falls out as a consequence of using that package. See https://code-review.googlesource.com/9613.

The fields package can be configured to parse tags, so tags can easily be added. But I'd like to see a design doc/issue. The tags need to work across inference, loading and saving. Just supporting an alternative name with a tag, like bigquery:"othername", would be trivial, but I would want to make sure that had value. It clearly does in JSON because many JSON objects use lower-case keys, but I don't know if any bigquery users care. Especially since loading a row into a struct will use case-insensitive comparison of column name to field name.

@schmohlio @nightlyone I understand now that for both of you, the goal is to infer a schema from JSON, not a Go struct. Supporting Go structs via reflection is in the tradition of other Go packages like encoding/json and datastore, so I feel that it belongs in this package. However, I don't think that inferring a schema from a JSON object fits in that category, useful though it may be. And the function can easily be provided outside of this package. So I'd like to keep it that way. @schmohlio, I recommend you make your package production-ready and available via go get, and we will be happy to point interested folk to it.

@mcgreevy @zombiezen your thoughts?

schmohlio commented 7 years ago

@jba makes sense to me!

nightlyone commented 7 years ago

@jba My goal has been to infer a BigQuery schema from the same struct, that has been used to write out a JSON stream. This infered schema should then be used to read in this data from GCS via a BigQuery Load job.

jba commented 7 years ago

@nightlyone I see. That actually seems like a nice idea. I'm glad that the forthcoming embedding addition aligns BigQuery and encoding/json more.

If you only use "json" tags for case, then you shouldn't need "bigquery" tags, because column names are case-insensitive. If we did add tags, I doubt very much we'd honor "json" tags, so at a minimum you'd have redundant tags. Maybe you could tell me exactly how tags would help you (or point me to the comment where you've already said that, if I missed it).

nightlyone commented 7 years ago

Adding "bigquery" tags to those structs is no issue and honoring "json" tags would not be needed. What would be useful is the ignorance tag "-" like in JSON. That all would allow us to use only one structure instead of 2.