googleapis / google-cloud-go

Google Cloud Client Libraries for Go.
https://cloud.google.com/go/docs/reference
Apache License 2.0
3.72k stars 1.27k forks source link

bigquery: add a function to generate a schema from a struct. #213

Closed nightlyone closed 8 years ago

nightlyone commented 8 years ago

It would be great, if the BigQuery schema could be automatically derived via reflection from a provided example struct.

NOTE: There is already a TODO from @mcgreevy for it in the code, but I add it here so it is not forgotten and can be worked on by interested parties.

nightlyone commented 8 years ago

My first stab at this is here: https://code-review.googlesource.com/#/c/4220

mcgreevy commented 8 years ago

Thanks.

FYI I've started reviewing your change, but have not finished the first pass. I'll get back to you with comments tomorrow.

nightlyone commented 8 years ago

Some notes on design decisions I made in the first step and why.

I currently use this kind of functionality to infer a table schema from structs as it would be represented in JSON.

My goal here is: load jobs can generate the schema automatically for JSON blobs stored on Google Cloud Storage.

I plan to support non-required fields using a struct tag syntax similar to https://godoc.org/encoding/json#Marshal but using "bigquery" as the tag.

The support for TextMarshaler has been added to support elements bigquery and JSON cannot represent properly at the moment (e.g. complex numbers) but still have a useful type in Go. The fmt.Stringer interface might need to be supported, too.

I left out pointer (e.g. *T) and map support (map[K]T) as well as drilling into interfaces, as this would be too complicated for a first step.

I included MustInferSchema because that is the only useful way to use it at the moment, because the information passed in will never change at runtime and thus any failure can be seen running any test in this or starting a program using it. So I would at least included it as an example of best practice.

mcgreevy commented 8 years ago

Your plan of using struct tags to indicate non-required fields sounds reasonable, although it would need to be well documented that this "required" struct tag was only used for table creation operations, and not e.g. when actually loading data (e.g. if a user of this library were loading data into an existing table, and one of the fields was optional according to the server, but required according to the Go struct tag, I would would expect the struct tag to be ignored -- i.e. no client side validation).

I can imagine something like TextMarshaler support being useful, but it's premature at the moment. I'd like to see support for basic types first, then support streaming inserts of these same structs, before supporting more esoteric types. In the meantime, clients will need to do some data conversions, but I don't think that will be too onerous.

Regarding MustInferSchema: there are multiple ways that users may want to handle errors when inferring schemas; panicking is only one of them. If users want something like MustInferSchema, they can write it themselves: not everything that is potentially useful needs to go in the library.

nightlyone commented 8 years ago

A more open and active discussion of the default for the "required" attribute is needed.

nightlyone commented 8 years ago

Open sub-issues to finish this:

mcgreevy commented 8 years ago

@nightlyone : I've created a doc to discuss required/optional:

https://docs.google.com/document/d/1B3yg3HHQrntTw_rD1M94MhqREgg-MahrXfoiOyK_3b4/edit?usp=sharing

Please review and feel free to leave comments.

I'd like to also hear from @dsymonds on this.

nightlyone commented 8 years ago

@mcgreevy additionally to the comments left in the document, I think our use cases diverge a lot and it shows in the suggested solutions. I want to auto-create schema for JSON-blobs from the very same Go structures that generated these blobs.

Your goal seems to be codecs for query reading, table data reading and streaming writes. Let's please create those codecs in a later step and not assume their existence. It seems like a premature discussion to me.

They have also completely different APIs at the moment: table data and query reading return slices, streaming writes need maps. Conversion between both creates a lot of garbage, when we start taking the "Big" in BigQuery seriously :smile: So there is more creative work to do there.

I am going to need both in a different project and later anyway. So I would like to solve those problems in a later step and just avoid making them more difficult instead of assuming anything here.

nightlyone commented 8 years ago

@mcgreevy could you submit https://code-review.googlesource.com/#/c/4220 for me now? I don't have the permissions to do it.

jba commented 8 years ago

What's the status of this issue? From what I can tell, schema inference is implemented except for optional fields, and embedded structs. The latter is a separate issue (#238), although it mistakenly refers to nested structs, which are implemented.

So I think we can close this. If there is interest in schema inference of optional fields, we can open another issue.

@mcgreevy what do you think?

nightlyone commented 8 years ago

@jba I am all for closing it. We derived additional work already. But then I would suggest to open one for optional fields.

jba commented 8 years ago

@nightlyone I will close, and let you open since you understand the issues more than me.