Open ar1nd4m opened 1 year ago
The better solution here is for the backend to fully support consumption of the well known protobuf types, rather than transforming. This is tracked internally as 193064992, I'll check to see if there's been any progress here.
Feedback from the backend team is that it would be good is to understand the destination use cases in more detail here. You indicate that you want google.protobuf.Duration
to map as a string. Is the BigQuery schema actually a STRING, or is it something like an INTERVAL?
Similarly, it would be good to know how the google.protobuf.Timestamp
types are consumed. You seem to indicate TIMESTAMP, but are there other mappings in use?
Hi, I replied on the internal ticket as well, but https://protobuf.dev/programming-guides/proto3/#json is the specification that describes this behavior.
The BigQuery schema is actually a STRING (not an INTERVAL) for both the types google.protobuf.Timestamp
and google.protobuf.Duration
.
well-known protobuf support would not solve this particular issue, as we need this to be compatible with whatever https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema generates - and it generates a STRING for both these types, with the expectation that the data is encoded as per the above standard.
Arguably protoc-gen-bq-schema is also in need of updates here. A google.protobuf.Timestamp
should likely produce a schema with a TIMESTAMP, and a google.protobuf.Duration
should likely map to INTERVAL. Regardless, what you're highlighting is the need to customize the descriptor in some fashion according to user-specific needs.
The adapt.NormalizeDescriptor()
function consumes a proto descriptor and outputs the normalized DescriptorProto message. You can certainly manipulate that output by walking the fields and making additional changes.
The part that's not clear to me is how you prepare the serialized messages. Are you simply serializing messages defined with this large source proto descriptor? I don't understand how you're dealing with these type changes in this case, if so. Or are you building a dynamic descriptor from the normalized DescriptorProto and rewriting the message dynamically?
About Duration to String conversion, I am wondering how it can be meaningful for queries? Maybe the protoc-gen-bq-schema needs to be fixed?
Out of curiosity, what is the state of this today, more than a year later. It looks like we get a schema with TIMESTAMP from google.protobuf.Timestamp
.
And on a different note, how experimental are we talking here, since that line of code was written 3 years ago? https://github.com/googleapis/google-cloud-go/blob/3d6154588da598092e5cffe6c1c5770998037fe1/bigquery/storage/managedwriter/adapt/doc.go#L18
Is your feature request related to a problem? Please describe. It seems that when a protobuf is used to generate a BigQuery schema using https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema, the schema generated is incompatible with the managedwriter stream that can be created via
adapt.NormalizeSchema()
.This makes
managedwriter
less useful in being able to write to a BQ table created usingprotoc-gen-bq-schema
.This is because there is currently a difference in how proto fields are converted to bigquery fields in
protoc-gen-bq-schema
vs inmanagedwriter
.protoc-gen-bq-schema
convertsgoogle.protobuf.Duration
toSTRING
(https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema/blob/master/pkg/converter/convert.go#L20), howevermanagedwriter/protoconversion
only handles a subset and doesn't handle Duration or Timestamp properly (https://github.com/googleapis/google-cloud-go/blob/main/bigquery/storage/managedwriter/adapt/protoconversion.go#L89)Describe the solution you'd like What is needed here is to bring feature parity between the BQ schema and the
protoconversion.go
module, which means adding/changing support for the following translations:Describe alternatives you've considered I do not know of any other ideas.
Additional context We generate a very very large BQ schema (impossible to generate by hand) from a very well-used proto. This makes it very difficult to write without specifying the schema or dropping many of the fields.