hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.86k stars 1.95k forks source link

`make proto` generates incompatible protobufs with buf v0.41.0 #10293

Open schmichael opened 3 years ago

schmichael commented 3 years ago

Nomad version

Nomad v1.1.0-dev

Issue

Using the wrong version of buf to compile protobufs causes an agent panic.

As of Nomad v1.1.0-dev we need to stick to buf ~v0.30.1~ v0.36.0 which is now installed by make bootstrap.

Reproduction steps

  1. Install buf v0.41.0 and run make proto
  2. Try to use an external task driver plugin

Expected Result

It works. Our wire protocol should always be forward/backward compatible.

Actual Result

The Nomad agent panics with:

panic: protobuf tag not enough fields in Timestamp.state:

goroutine 350 [running]:
github.com/golang/protobuf/proto.(*unmarshalInfo).computeUnmarshalInfo(0xc000142640)
        go/src/github.com/hashicorp/nomad/vendor/github.com/golang/protobuf/proto/table_unmarshal.go:332 +0x1777
github.com/golang/protobuf/proto.(*unmarshalInfo).unmarshal(0xc000142640, 0xc0009c6280, 0xc000850154, 0xc, 0x12, 0x0, 0x2c003b8)
        go/src/github.com/hashicorp/nomad/vendor/github.com/golang/protobuf/proto/table_unmarshal.go:136 +0xf45
github.com/golang/protobuf/proto.makeUnmarshalMessagePtr.func1(0xc000850154, 0x12, 0x13, 0xc0001279b0, 0x2, 0xc000507300, 0x2, 0xc000a90320, 0x198, 0x8)
        go/src/github.com/hashicorp/nomad/vendor/github.com/golang/protobuf/proto/table_unmarshal.go:1646 +0x12b
github.com/golang/protobuf/proto.(*unmarshalInfo).unmarshal(0xc000142500, 0xc0001279a0, 0xc000850152, 0x14, 0x14, 0x0, 0x2c003b8)
        go/src/github.com/hashicorp/nomad/vendor/github.com/golang/protobuf/proto/table_unmarshal.go:173 +0x88e
github.com/golang/protobuf/proto.makeUnmarshalMessagePtr.func1(0xc000850152, 0x14, 0x15, 0xc0006a36b0, 0x2, 0xc00007e9c0, 0x756ea17d84c242, 0x8013eed091f202a3, 0x0, 0xc00007e888)
        go/src/github.com/hashicorp/nomad/vendor/github.com/golang/protobuf/proto/table_unmarshal.go:1646 +0x12b
github.com/golang/protobuf/proto.(*unmarshalInfo).unmarshal(0xc0001423c0, 0xc0006a36b0, 0xc000850150, 0x16, 0x16, 0x40f36c, 0x3c05390)
        go/src/github.com/hashicorp/nomad/vendor/github.com/golang/protobuf/proto/table_unmarshal.go:173 +0x88e
github.com/golang/protobuf/proto.(*InternalMessageInfo).Unmarshal(0x3bd0240, 0x2bc0dc8, 0xc0006a36b0, 0xc000850150, 0x16, 0x16, 0x1, 0x7ff3f8284bb8)
        go/src/github.com/hashicorp/nomad/vendor/github.com/golang/protobuf/proto/table_unmarshal.go:63 +0x66
github.com/hashicorp/nomad/plugins/drivers/proto.(*TaskStatsResponse).XXX_Unmarshal(0xc0006a36b0, 0xc000850150, 0x16, 0x16, 0xc0006a36b0, 0x2705101)
        go/src/github.com/hashicorp/nomad/plugins/drivers/proto/driver.pb.go:1139 +0x65
github.com/golang/protobuf/proto.(*Buffer).Unmarshal(0xc000626f38, 0x2bc0dc8, 0xc0006a36b0, 0x0, 0x0)
        go/src/github.com/hashicorp/nomad/vendor/github.com/golang/protobuf/proto/decode.go:399 +0xa3
google.golang.org/grpc/encoding/proto.codec.Unmarshal(0xc000850150, 0x16, 0x16, 0x25a6fe0, 0xc0006a36b0, 0x2, 0x2)
        go/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/encoding/proto/proto.go:93 +0x130
google.golang.org/grpc.recv(0xc000710fc0, 0x7ff3f8378d90, 0x3c05458, 0xc0004b5b00, 0x0, 0x0, 0x25a6fe0, 0xc0006a36b0, 0x7fffffff, 0x0, ...)
        go/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/rpc_util.go:711 +0x118
google.golang.org/grpc.(*csAttempt).recvMsg(0xc0004b3880, 0x25a6fe0, 0xc0006a36b0, 0x0, 0xc011f76016e2d671, 0x35d3bec09)
        go/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:885 +0xee
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc0004b3880, 0x8, 0x2823a2a)
        go/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:736 +0x46
google.golang.org/grpc.(*clientStream).withRetry(0xc0000d0fc0, 0xc0004d9e80, 0xc0004d9e50, 0x30, 0x28)
        go/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:594 +0x9f
google.golang.org/grpc.(*clientStream).RecvMsg(0xc0000d0fc0, 0x25a6fe0, 0xc0006a36b0, 0xc00035ab08, 0xc0008cff7a)
        go/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:735 +0x105
github.com/hashicorp/nomad/plugins/drivers/proto.(*driverTaskStatsClient).Recv(0xc000632e10, 0xc0008cff20, 0x4df569, 0xc0005eaa08)
        go/src/github.com/hashicorp/nomad/plugins/drivers/proto/driver.pb.go:4048 +0x62
github.com/hashicorp/nomad/plugins/drivers.(*driverPluginClient).handleStats(0xc00049e840, 0x2bcdf78, 0xc0008785c0, 0xc000403560, 0x2be6e58, 0xc000632e10)
        go/src/github.com/hashicorp/nomad/plugins/drivers/client.go:292 +0x7d
created by github.com/hashicorp/nomad/plugins/drivers.(*driverPluginClient).TaskStats
        go/src/github.com/hashicorp/nomad/plugins/drivers/client.go:284 +0x3c5

Generating protos with buf v0.41.0

tgross commented 3 years ago

So I tried to figure out what might be happening here from source changes, but what makes that particularly challenging is that buf is not in any way developed in the open. The commit history is mostly just them syncing back and forth from their internal repo with giant squashed changes.

I'm going to propose that we remove buf and go back to using protoc directly. The entire value proposition for buf is that it's supposed to help us avoid backwards incompatible changes, yet they broke our build and don't make it easy for us to figure out what's gone wrong with it.

tgross commented 3 years ago

After an internal discussion we decided to hold off on tearing it out until we could at least file an issue with upstream, and it turns out that it's apparently not a simple round-tripping problem with buf. It looks like the gRPC library is in the mix here... still trying to build a minimal repro.

tgross commented 3 years ago

Unfortunately I was able to eliminate gRPC from the mix and the problem can be reproduced entirely with buf alone. I've built a minimal reproduction here https://github.com/tgross/bufbreaks that generates a protobuf with one version and reads it in with another.

I've opened https://github.com/bufbuild/buf/issues/315 with upstream to get their thoughts.

$ make test
go build -trimpath -tags old -o bin/old .
go build -trimpath -tags new -o bin/new .
./bin/old write
./bin/new read
panic: protobuf tag not enough fields in Timestamp.state:

goroutine 1 [running]:
github.com/golang/protobuf/proto.(*unmarshalInfo).computeUnmarshalInfo(0xc0000ce460)
        github.com/golang/protobuf@v1.5.0/proto/table_unmarshal.go:332 +0x1777
github.com/golang/protobuf/proto.(*unmarshalInfo).unmarshal(0xc0000ce460, 0xc0000b2240, 0xc0000ee00d, 0xb, 0x1f3, 0xc0000ee004, 0x7)
        github.com/golang/protobuf@v1.5.0/proto/table_unmarshal.go:136 +0xf45
github.com/golang/protobuf/proto.makeUnmarshalMessagePtr.func1(0xc0000ee00d, 0xb, 0x1f4, 0xc0000b2210, 0x2, 0xc0000ee00b, 0xd, 0x1f5, 0x0, 0x0)
        github.com/golang/protobuf@v1.5.0/proto/table_unmarshal.go:1646 +0x12b
github.com/golang/protobuf/proto.(*unmarshalInfo).unmarshal(0xc0000ce3c0, 0xc0000b2200, 0xc0000ee002, 0x16, 0x1fe, 0x0, 0x12aa320)
        github.com/golang/protobuf@v1.5.0/proto/table_unmarshal.go:173 +0x88e
github.com/golang/protobuf/proto.makeUnmarshalMessagePtr.func1(0xc0000ee002, 0x16, 0x1ff, 0xc00009c870, 0x2, 0xc0000c5d68, 0x756ea1d8d90c87, 0x203000, 0x203000, 0x203000)
        github.com/golang/protobuf@v1.5.0/proto/table_unmarshal.go:1646 +0x12b
github.com/golang/protobuf/proto.(*unmarshalInfo).unmarshal(0xc0000ce320, 0xc00009c870, 0xc0000ee000, 0x18, 0x200, 0x100a06c, 0x141c920)
        github.com/golang/protobuf@v1.5.0/proto/table_unmarshal.go:173 +0x88e
github.com/golang/protobuf/proto.(*InternalMessageInfo).Unmarshal(0x13eea20, 0x12a7108, 0xc00009c870, 0xc0000ee000, 0x18, 0x200, 0x4a31901, 0x4a39058)
        github.com/golang/protobuf@v1.5.0/proto/table_unmarshal.go:63 +0x66
main.(*TaskStatsResponse).XXX_Unmarshal(0xc00009c870, 0xc0000ee000, 0x18, 0x200, 0xc00009c870, 0xc0000c5e01)
        bufbreaks/structs.new.pb.go:41 +0x65
github.com/golang/protobuf/proto.Unmarshal(0xc0000ee000, 0x18, 0x200, 0x12a7108, 0xc00009c870, 0xc0000c5f40, 0x11f8f7a)
        github.com/golang/protobuf@v1.5.0/proto/decode.go:337 +0x1aa
main.deserialize(0xc0000ee000, 0x18, 0x200)
        bufbreaks/serde.new.go:33 +0x85
main.main()
        bufbreaks/main.go:34 +0xcb
make: *** [test] Error 2
bufdev commented 3 years ago

Just to reiterate the points in https://github.com/bufbuild/buf/issues/315 here:

On the point of not being developed in the open, we'd argue that we are as much as possible, we have a backend monorepo for both our public and private code, and yes do sync it across, but obviously there's nuance.

bufdev commented 3 years ago

Providing some more detail here from 315:

Nothing has changed about the the structure of google.protobuf.Timestamp, except Google did introduce a change to the go_package value in v3.14.0, updating it to point to google.golang.org/protobuf/types/known/timestamppb, which is a breaking change. This shouldn't affect your above code, however - this has nothing to do with serialization or the wire compatibility of Timestamp.

Looking through https://github.com/tgross/bufbreaks/blob/trunk/serde.new.go, it appears you are using the old github.com/golang/protobuf/proto with the new Timestamp type, which could cause a problem, you should be using the new https://pkg.go.dev/google.golang.org/protobuf/proto#Marshal with new types.

As to buf not breaking here, we actually agree - even though this is just changing the go_package value, i.e. where the generated code for Timestamp is expected to be, we detect this as a breaking change:

buf breaking --against v3.13.0 v3.14.0
v3.14.0/google/protobuf/any.proto:36:1:File option "go_package" changed from "github.com/golang/protobuf/ptypes/any" to "google.golang.org/protobuf/types/known/anypb".
v3.14.0/google/protobuf/api.proto:43:1:File option "go_package" changed from "google.golang.org/genproto/protobuf/api;api" to "google.golang.org/protobuf/types/known/apipb".
v3.14.0/google/protobuf/compiler/plugin.proto:53:1:File option "go_package" changed from "github.com/golang/protobuf/protoc-gen-go/plugin;plugin_go" to "google.golang.org/protobuf/types/pluginpb".
v3.14.0/google/protobuf/descriptor.proto:44:1:File option "go_package" changed from "github.com/golang/protobuf/protoc-gen-go/descriptor;descriptor" to "google.golang.org/protobuf/types/descriptorpb".
v3.14.0/google/protobuf/duration.proto:37:1:File option "go_package" changed from "github.com/golang/protobuf/ptypes/duration" to "google.golang.org/protobuf/types/known/durationpb".
v3.14.0/google/protobuf/empty.proto:36:1:File option "go_package" changed from "github.com/golang/protobuf/ptypes/empty" to "google.golang.org/protobuf/types/known/emptypb".
v3.14.0/google/protobuf/field_mask.proto:40:1:File option "go_package" changed from "google.golang.org/genproto/protobuf/field_mask;field_mask" to "google.golang.org/protobuf/types/known/fieldmaskpb".
v3.14.0/google/protobuf/source_context.proto:40:1:File option "go_package" changed from "google.golang.org/genproto/protobuf/source_context;source_context" to "google.golang.org/protobuf/types/known/sourcecontextpb".
v3.14.0/google/protobuf/struct.proto:37:1:File option "go_package" changed from "github.com/golang/protobuf/ptypes/struct;structpb" to "google.golang.org/protobuf/types/known/structpb".
v3.14.0/google/protobuf/timestamp.proto:37:1:File option "go_package" changed from "github.com/golang/protobuf/ptypes/timestamp" to "google.golang.org/protobuf/types/known/timestamppb".
v3.14.0/google/protobuf/type.proto:44:1:File option "go_package" changed from "google.golang.org/genproto/protobuf/ptype;ptype" to "google.golang.org/protobuf/types/known/typepb".
v3.14.0/google/protobuf/wrappers.proto:47:1:File option "go_package" changed from "github.com/golang/protobuf/ptypes/wrappers" to "google.golang.org/protobuf/types/known/wrapperspb".

We don't think a change should have been made to the Well-Known Types, ever, but are hands of kind of tied here. The Protobuf ecosystem relies on the WKTs being built-in to the compiler (although protoc relies on them being at ../include relative to the binary, which is even more broken), and Google made this upgrade. If there was a wire incompatibility (which there is not here, your problem is something else), and you use protoc in a similar way, it will break as well. But given that users use the WKTs extensively, and not upgrading would be even more of a "breaking" change for the majority of users given how the Protobuf ecosystem works, we have no choice but to choose the lesser of two evils.

One way to solve this, if you are concerned with WKT breakages, is to vendor the WKTs yourself:

# This is actually the best way, not kidding
$ mkdir -p tmp && \
  cd ./tmp && \
  curl -sSL https://github.com/protocolbuffers/protobuf/releases/download/v3.15.8/protoc-3.15.8-linux-x86_64.zip -o protoc.zip && \
  unzip protoc.zip >/dev/null && \
  mv ./include/google ~/your/repo/third_party/proto/google

And then add third_party/proto as a root:

# buf.yaml
version: v1beta1
build:
  roots:
    - third_party/proto

If you have the WKTs included yourself, both buf and protoc will ignore their builtins and use these instead.

Of note though, WKTs have never had a wire-incompatible change in over a decade.

bufdev commented 3 years ago

I'm trying to hunt down your issue, and one thing is that you are not versioning protoc-gen-go in https://github.com/tgross/bufbreaks/blob/trunk/Makefile - I am assuming you are using the old protoc-gen-go?

bufdev commented 3 years ago

I've posted an overview of the problem at https://github.com/tgross/bufbreaks/pull/1

bufdev commented 3 years ago

Another option at https://github.com/tgross/bufbreaks/pull/2

dsnet commented 3 years ago

I'm not sure the changed go_package option is necessarily the root cause. At least the stack trace https://github.com/hashicorp/nomad/issues/10293#issuecomment-832801234 suggest that incompatible versions due to vendoring.

I suspect the following is happening:

The fix seems to be upgrade the vendored version of github.com/golang/protobuf to v1.4.x or later.

dsnet commented 3 years ago

I should further note that this breakage isn't the fault of buf as it can occur when using protoc-gen-go directly.

tgross commented 3 years ago

Thanks @bufdev I think we can take it from here.

@schmichael I'm going to punt on this for now as we obviously don't need to make any changes for 1.1.0. But there's also a small pile of developer documentation issues this implies around third-party task driver plugins. Just because the wire protocol is compatible doesn't mean that folks trying to reuse marshal/unmarshal code from the exec task driver (which is a typical starting point for folks despite the existence of https://github.com/hashicorp/nomad-skeleton-driver-plugin) aren't going to have a bad time if they aren't using the exact same set of dependencies.