golang / protobuf

Go support for Google's protocol buffers
BSD 3-Clause "New" or "Revised" License
9.64k stars 1.58k forks source link

feature request: embed raw protobuf instead of putting the data in hexadecimal #1398

Open karelbilek opened 2 years ago

karelbilek commented 2 years ago

Is your feature request related to a problem? Please describe. When doing code review or git diff, the changes in hexadecimal raw protobuf are big and take a lot of screen space.

I also think (but not that sure) that the additional encoding takes more space in git repository for no reason.

The file is binary, it should be saved as binary when go now allows it since 1.16.

If the files were saved as binary and just embedded with //go:embed, it would simplify the files.

However, this would limit the result to go 1.16; so it would need to still be optional.

Describe the solution you'd like I have already made PR here earlier this month, which was not commented so far

https://go-review.googlesource.com/c/protobuf/+/369634

Additional context I probably shouldn't have made a PR without asking here first, but I did. The PR seems to be working when I tested it.

I'm not sure how to get the filename at one point without doing path.Split, but that's an implementation detail.

puellanivis commented 2 years ago

I do not think that protobuf is ready to write off supporting all go versions before 1.16. Even making it optional is not particularly a good idea, because a variety of go versions could all be pointing at the same generated code, meaning various public protobufs wouldn’t be able to benefit from this, and a new public protobuf that uses the feature could isolate a lot of users.

It does take up more space in the git repo, a quick calculation a single byte becomes two bytes, so it is easily a 2× overhead of space. However, I have yet to see any situation where this would be a primary concern. We’re not really living in the days of squeezing bytes of storage. I think most people have an overabundance of persistant storage now, which is exactly how github can even operate.

karelbilek commented 2 years ago

Even making it optional is not particularly a good idea, because a variety of go versions could all be pointing at the same generated code, meaning various public protobufs wouldn’t be able to benefit from this, and a new public protobuf that uses the feature could isolate a lot of users.

As a code maintainer and code owner, I should be able to make that decision.

I think the go:embed is a nice feature and protoc-gen-go should be able to use it, as the generated code is just nicer. (I am not aware of implementation details of go:embed, so not sure if there are some other positives/negatives there.)

But it's just my opinion, in the end, you are the maintainers of this repo :)

leighst-anchorage commented 1 year ago

Another reason to do this is it would eliminate merge conflicts which are otherwise unavoidable.

In a large monorepo without something like bazel, committing generated files is very nice because it reduces impact on people who are not making changes to protobuf files (who would otherwise have to regenerate frequently).

Making it optional seems reasonable? In the case you mention, unless im mistaken, the provider / protobuf author would simply choose not to use it. In our case where we control all clients, it wouldnt be an issue.

karelbilek commented 1 year ago

Well, you would have the merge conflicts anyway, just with raw binary files rather than the go code. That wouldn't change.

The only difference is that it won't show up in git diff etc as it does now, and would be more clean.

If you don't commit generated files, nothing would change...

leighst-anchorage commented 1 year ago

I thought the hexadecimal rawDesc field was just a hex encoded copy of the input .proto file?

If that's the case could we not just embed the .proto file directly?

Or am I mistaken?

neild commented 1 year ago

The rawDesc is the wire-encoded FileDescriptorProto.

leighst-anchorage commented 1 year ago

I see. My mistake - thanks.

dsnet commented 1 year ago

The logic for formatting the rawDesc simply chunks it on 16B boundaries. A single byte insertion/removal at the start will change all subsequent bytes afterwards. If we want to reduce the amount of line diffs, we could insert newlines based on a rolling hash of the content. That way, the chunking is content-based, rather than position based.

karelbilek commented 1 year ago

@dsnet but the content is gzipped, so it's hard to tell what changed

dsnet commented 1 year ago

It used to be in the past, but not anymore. The space savings from performing gzip was not worth the cost in extra initialization time decompressing the descriptors.

The deprecated Descriptor methods return a gzipped descriptor, but does so lazily. In fact, it simply slaps the gzip headers on the raw descriptors and does no compression what-so-ever.