protocolbuffers / protobuf

Protocol Buffers - Google's data interchange format
http://protobuf.dev
Other
65.74k stars 15.51k forks source link

protoc should not populate `json_name` in descriptor if no `json_name` was specified #5587

Open zellyn opened 5 years ago

zellyn commented 5 years ago

What version of protobuf and what language are you using? Version: v3.6.0 Language: any

What operating system (Linux, Windows, ...) and version? n/a

What runtime / compiler are you using (e.g., python version or gcc version) n/a

What did you do? Steps to reproduce the behavior:

test.proto:

syntax = "proto3";

package testing;

message Animal {
  string name = 1;
  bool has_hooves = 2;
  int32 birth_year = 3 [json_name = "year_of_birth"];
}

Command:

protoc -o /dev/stdout test.proto | protoc --decode=google.protobuf.FileDescriptorSet descriptor.proto

Output:

file {
  name: "test.proto"
  package: "testing"
  message_type {
    name: "Animal"
    field {
      name: "name"
      number: 1
      label: LABEL_OPTIONAL
      type: TYPE_STRING
      json_name: "name"
    }
    field {
      name: "has_hooves"
      number: 2
      label: LABEL_OPTIONAL
      type: TYPE_BOOL
      json_name: "hasHooves"
    }
    field {
      name: "birth_year"
      number: 3
      label: LABEL_OPTIONAL
      type: TYPE_INT32
      json_name: "year_of_birth"
    }
  }
  syntax: "proto3"
}

What did you expect to see It should be possible to tell whether json_name was specified in the .proto file. In particular, there should be a way to distinguish that json_name was specified for birth_year, and not specified for has_hooves.

What did you see instead? It is impossible to tell whether json_name was specified in the .proto file, because protoc invents a json_name for each field, regardless.

Anything else

  1. This makes it impossible to create a json serializer that prefers the vanilla protobuf field names (with underscores, by convention), but lets individual fields be overridden. Instead, the default is the weird java-ish mangled version of field names. I'm fine with that being the default, but other options should be possible. For instance, in Go's jsonpb, there is an OrigName option. But it's all-or-nothing. If you opt out of the java-ish names, you also opt out of being able to override a name.
  2. This bloats the serialized descriptor. The language runtimes can manufacture default json field names just fine.
  3. This makes it less possible to recreate something close to the original .proto file from the descriptor.
ObsidianMinor commented 5 years ago

The one thing this issue fails to address is how this works in a language neutral way. If two language runtimes fail to make the same default json field name because of a quirk with one runtime then they can't work together. Sure that might be a bug with the quirked runtime, but the whole issue could just be avoided if the runtime was given the same json name as the other runtimes via protoc.

On top of that, calculating a json name takes time. It may not be a long time, but it is time and memory. It's not time and memory to calculate it ahead of time and reference it as a constant value (depending on the runtime).

The last point doesn't seem like much of an issue. You should be distributing .proto files to anyone that needs them. The only people I know that can't get .proto files distributed to them are people writing Steam bots and Pokemon Go bots. Feel free to list any legitimate reasons people should be able to more easily reverse engineer a .proto file from a descriptor.

zellyn commented 5 years ago

The biggest complaint is that it makes it impossible to tell whether a user specified a json_name or not. I suppose it would be possible to add an additional bool explicit_json_name field or something.

We have a lot of code still on protobuf 3.0.0, and when we try to upgrade to a newer protobuf release, all the json field names change. We can use the OrigName option in jsonpb, but that will completely disallow any fieldname customization. What we really need is a PreferOrigName option that still allows explicit field renaming. But the fact that protoc is synthesizing stuff into the descriptor precludes the possibility of implementing such an option, either in upstream golang/protobuf or even in our own custom serialization code.

As far as the time burden of calculating json names… in compiled runtimes that often happens at generation time (it does in Go, currently), and even in interpreted languages, it's surely highly cacheable. Besides, they all have to have the code/machinery to be able to handle descriptors that lack json_name anyway, so it's already there.

It's true that recreating a .proto file from the descriptor is uncommon (although they do/can include all kinds of comment and line number information in the SourceCodeInfo message, which seems built for that purpose); I guess I was trying to articulate the fact that it “feels wrong” to introduce extra synthetic information into the descriptor.

TeBoring commented 5 years ago

https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/compiler/command_line_interface.cc#L1972 I think we can have a command line option to decide whether to include json name in the output. @acozzette how do you think?

zellyn commented 5 years ago

Yeah, I think a commandline option is perfect. It should always reflect the json_name if it was specified in the .proto file, but the commandline option could turn off the synthesis of googleCase defaults.

zellyn commented 5 years ago

Any updates on this? Would sending a PR help?

acozzette commented 5 years ago

Sorry this slipped off my radar. Usually we try to avoid new options but maybe it is justified in this case. I'm guessing if we just removed json_name unconditionally then things might break. @haon4 What do you think?

ghost commented 5 years ago

I don't think we should change the current behavior as users are already relying on it. I am also not a fan of adding more command line option, as this doesn't seem like a common feature that other people are asking for, and I really don't like to add more switch for one use-case.

Can this problem be solved at the application level rather than at the protobuf level? Does it really matter if the json_name was explicitly specified or not? If you make the lowerCamelCase transformation on the field name and that matches the json_name option, can you assume that the json_name is not specified? Even if the json_name is explicitly specified in that case, does it matter?

zellyn commented 5 years ago

I thought I described why earlier, but no, it is not solvable at an application level, because the way protoc currently populates json_name for all fields makes it impossible to tell whether any given field actually had a json_name property.

A commandline flag would be perfect.

zellyn commented 5 years ago

btw, if y'all don't add a commandline option, we will be forced to fork protoc internally, and add one 😞 Either that or specify a json_name on every single existing field that includes underscores (also 😞). I'm happy to rough together a PR adding a commandline option, if that would help 🙂.

ghost commented 5 years ago

Fair enough. If you can put together a PR for this command line option, it may help. Point taken that this can reduce the size bloat on the serialized descriptor, and most languages should already handle the fallback in case json_name is not there, so I think that may be something we would want to enable by default in the long term. I see there are still many places that are making the assumption about json_name being there though, but having the command line option may enable us to slowly knock them out.

bufdev commented 5 years ago

I think removing json_name from returned FileDescriptorSets is unfortunately a breaking change at this point - external plugins could totally be relying on that to remove the need to compute json_name themselves. @jhump

Another option would be to add a field to google.protobuf.FieldDescriptorProto that is something like i.e. optional bool json_name_explicitly_set = 11; that protoc could populate, and plugins could make decisions based on this. If this was done, the builtin plugins should handle this field as well re: #6175. This seems dirty, but may actually be cleaner than a protoc command-line flag, and it is kind of how Protobuf is intended to be used (you can't change the json_name field, but you can iterate on your message definition by adding new fields).

ghost commented 5 years ago

Yea, I think having something like optional bool json_name_explicitly_set = 11; would be quite nice, actually. I would be happy with that.

bufdev commented 5 years ago

On second thought, I think it has to be the negative for backwards compatibility reasons, unless I'm confusing myself, ie optional bool json_name_NOT_explicitly_set = 11;. If you don't have this, then you can't tell the difference between explicitly_set == false being because of using an old version of protoc, or because it was actually not set. I might be confusing myself though in double-negatives.

The work as I see it:

Note that for backwards compatibility, when running a plugin (both the builtins and external plugins), you need to do the following with this (in pseudo-code):

if get(jsonNameField) != "" {
  // if this is false, it could be due to not being set at all with older versions of protoc,
  // in which case we have to act as if the jsonName was explicitly set
  if get(jsonNameNotExplicitlySetField) { 
    actAsIfNotSet()
  } else {
    actAsIfSet()
  }
}
jhump commented 5 years ago

I think it has to be the negative for backwards compatibility reasons

Remember that descriptor.proto is proto2 syntax. So it could just use a default value:

optional bool json_name_configured = 11 [default=true];
bufdev commented 5 years ago

Good point. One issue though is that makes the definition a tad weird - even if no FileDescriptorSet provider (ie protoc et al) would produce this, what does it mean when json_name == “” && json_name_configured == true?

jhump commented 5 years ago
string foo = 1 [json_name=""];

:P

(Related: https://github.com/protocolbuffers/protobuf/issues/5682)

bufdev commented 5 years ago

Lol

zellyn commented 5 years ago

I'm fine with json_name_explicitly_set with a default of false. If you're in this situation, you can probably just start using a new version of protoc. Anything I can help do to push this forward?

bufdev commented 5 years ago

I think it needs to be json_name_not_explicitly_set, this does matter for being able to really reason about the value of the field, see above

blacktooth commented 2 years ago

@elharo Do you know if someone is still working on this? Can we expect this config to be part of future releases?

I am willing to contribute if this is not being worked on.

elharo commented 2 years ago

I don't think anyone is working on this. However, reading through this now I confess I don't see a use case for this feature. That is, why does this matter? Until I can achieve clarity on that I don't think I'd be likely to approve any such PR. Perhaps someone else sees something here I don't.

My thinking (not dispositive) is as follows:

  1. Every field has a json_name which is fully determinable by reading the .proto file.
  2. When protoc emits a descriptor as JSON, it includes the json_name.
  3. The json_name in the .proto can be calculated or explicitly specified, but it doesn't matter which nor should any processing depend on that syntax detail.

I'm not sure about the "fully" in step 1. There are command line options that change this, but that doesn't change my conclusion. This feature simply doesn't seem to be needed, likely encourages bad code and non-interoperable protos, and is unlikely to be worth the added complexity in code.

Metaphorically it's like trying to change the byte code emitted by javac so that we can tell whether it was indented with spaces or tabs.

jhump commented 2 years ago

Metaphorically it's like trying to change the byte code emitted by javac so that we can tell whether it was indented with spaces or tabs.

@elharo, I think the fact that an artifact of this is that protoc itself produces different code based on whether the input is the original source vs. a file descriptor set means this is also a consistency issue. And that's a problem because it means that strategies used by build tools could result in different (non-deterministic) output. That case is described more thoroughly in #6175 (which was closed as a duplicate of this issue).

blacktooth commented 2 years ago

We have built parsers that can read a file descriptor set and re-create the original .proto schema. This feature enables users to auto-discover schemas during stream processing and register them in the schema registry. Since the file descriptor set generation is different for the same .proto schema, it results in different schemas getting generated. This results in inconsistencies and non-deterministic behavior as called out by @jhump.

The Protobuf code base also acknowledges it as a bug and a work around was introduced to deal with it.

Since this change will be non-backward compatible, I recommend that we add the boolean flag json_name_explicitly_set (or a better name) as suggested previously.

Reference

Generating proto schema from FileDescriptor

elharo commented 2 years ago

@blacktooth "Since the file descriptor set generation is different for the same .proto schema, it results in different schemas getting generated." What exactly do you mean by different here? Are some fields present in one and not the other? Are the fields renumbered? Are the fields reordered? Is the white space changed? Are comments added or removed?

I don't expect the results would be byte-per-byte identical, and some of these changes are more serious than others.

elharo commented 2 years ago

For future reference, relevant code seems to be in protobuf/compiler/command_line_interface.cc void CommandLineInterface::GetTransitiveDependencies

rdesgroppes commented 1 year ago

Any progress on this?

fowles commented 1 year ago

I don't think anyone is working on this right now

rdesgroppes commented 1 year ago

Sad. #6175 was closed as a dup of this one, but summarizes the problem pretty well:

protoc: code gen is different between using source files and using descriptor set

github-actions[bot] commented 6 months ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

rdesgroppes commented 6 months ago

/keepopen

zellyn commented 4 months ago

Coming back to this after a long time. Reading the comments here and on related issues, it seems there is a loose consensus among external tools that if the json_name in the protobuf descriptor matches the auto-generated one, then the user/.proto-file probably didn't actually specify a json_name explicitly.

What if we simply codified that?

Here's my proposal:

If a field in the .proto file explicitly specifies a json_name, and that name is the same as the one that would have been auto-generated for the field, then protoc should add the json_name_explicitly_set attribute to that field.

This would allow existing tooling to be mostly unchanged, and would provide a solution in the rare cases when the heuristic adopted by external tooling goes wrong: re-generate your proto descriptor with a more recent version of protoc, and check for json_name_explicitly_set.

github-actions[bot] commented 1 month ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago. This issue will be closed and archived after 14 additional days without activity.

rdesgroppes commented 1 month ago

/keepopen - mind above proposal