bigquery/storage/managedwriter: managedwriter should be compatible with protoc-gen-bq-schema

ar1nd4m commented 1 year ago

Is your feature request related to a problem? Please describe. It seems that when a protobuf is used to generate a BigQuery schema using https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema, the schema generated is incompatible with the managedwriter stream that can be created via adapt.NormalizeSchema().

This makes managedwriter less useful in being able to write to a BQ table created using protoc-gen-bq-schema.

This is because there is currently a difference in how proto fields are converted to bigquery fields in protoc-gen-bq-schema vs in managedwriter.

protoc-gen-bq-schema converts google.protobuf.Duration to STRING (https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema/blob/master/pkg/converter/convert.go#L20), however managedwriter/protoconversion only handles a subset and doesn't handle Duration or Timestamp properly (https://github.com/googleapis/google-cloud-go/blob/main/bigquery/storage/managedwriter/adapt/protoconversion.go#L89)

Describe the solution you'd like What is needed here is to bring feature parity between the BQ schema and the protoconversion.go module, which means adding/changing support for the following translations:

        ".google.protobuf.Duration"  -->   "STRING",
        ".google.protobuf.Timestamp" -->   "TIMESTAMP",

Describe alternatives you've considered I do not know of any other ideas.

Additional context We generate a very very large BQ schema (impossible to generate by hand) from a very well-used proto. This makes it very difficult to write without specifying the schema or dropping many of the fields.

shollyman commented 1 year ago

The better solution here is for the backend to fully support consumption of the well known protobuf types, rather than transforming. This is tracked internally as 193064992, I'll check to see if there's been any progress here.

shollyman commented 1 year ago

Feedback from the backend team is that it would be good is to understand the destination use cases in more detail here. You indicate that you want google.protobuf.Duration to map as a string. Is the BigQuery schema actually a STRING, or is it something like an INTERVAL?

Similarly, it would be good to know how the google.protobuf.Timestamp types are consumed. You seem to indicate TIMESTAMP, but are there other mappings in use?

ar1nd4m commented 1 year ago

Hi, I replied on the internal ticket as well, but https://protobuf.dev/programming-guides/proto3/#json is the specification that describes this behavior.

The BigQuery schema is actually a STRING (not an INTERVAL) for both the types google.protobuf.Timestamp and google.protobuf.Duration.

well-known protobuf support would not solve this particular issue, as we need this to be compatible with whatever https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema generates - and it generates a STRING for both these types, with the expectation that the data is encoded as per the above standard.

shollyman commented 1 year ago

Arguably protoc-gen-bq-schema is also in need of updates here. A google.protobuf.Timestamp should likely produce a schema with a TIMESTAMP, and a google.protobuf.Duration should likely map to INTERVAL. Regardless, what you're highlighting is the need to customize the descriptor in some fashion according to user-specific needs.

The adapt.NormalizeDescriptor() function consumes a proto descriptor and outputs the normalized DescriptorProto message. You can certainly manipulate that output by walking the fields and making additional changes.

The part that's not clear to me is how you prepare the serialized messages. Are you simply serializing messages defined with this large source proto descriptor? I don't understand how you're dealing with these type changes in this case, if so. Or are you building a dynamic descriptor from the normalized DescriptorProto and rewriting the message dynamically?

yirutang commented 1 year ago

About Duration to String conversion, I am wondering how it can be meaningful for queries? Maybe the protoc-gen-bq-schema needs to be fixed?

fredrikaverpil commented 1 week ago

Out of curiosity, what is the state of this today, more than a year later. It looks like we get a schema with TIMESTAMP from google.protobuf.Timestamp.

And on a different note, how experimental are we talking here, since that line of code was written 3 years ago? https://github.com/googleapis/google-cloud-go/blob/3d6154588da598092e5cffe6c1c5770998037fe1/bigquery/storage/managedwriter/adapt/doc.go#L18

googleapis / google-cloud-go

bigquery/storage/managedwriter: managedwriter should be compatible with protoc-gen-bq-schema #7766