jhump / protoreflect

Reflection (Rich Descriptors) for Go Protocol Buffers
Apache License 2.0
1.33k stars 170 forks source link

desc/builder: feature request: auto-de-duplicate builders and already-built descriptors in transitive graph #617

Closed kevin-break closed 2 months ago

kevin-break commented 2 months ago

Hey, jhump:

First let me say this is a incredible complement to standard Go API v2, and I enjoyed a lot and thank you!

I want to construct a compact transitive closure from a "start" message, "compact" meaning removing all non-relevant symbols, even they appears in the original .proto file, which is always good for storage / transfer over network.

With the following code snippet, it complained

Received unexpected error:
    descriptors have cyclic dependency:...

I suspect when I tried to AddMessage (to the FileBuilder), as wrapped from protoreflect.MessageDescriptor, it drags the import path, and for some reason, some mis-registration happens in your package, but I don't know too much details to tell / fix.

Can you please help? Thanks.

--- Code snippet ---

// TransitiveClosure returns a compact set of related protobuf message,
// as the transitive closure of the desired protobuf message.
// All the symbols are packaged into a single FileDescriptorSet proto,
// to be consistent with common usage of delivering such dependency over network.
func TransitiveClosure[T proto.Message]() (*descriptorpb.FileDescriptorSet, error) {
    var zero T
    md := zero.ProtoReflect().Descriptor()

    // A map from the file name to the index (within fbs) for each FileBuilder instance.
    builderIdx := make(map[string]int)
    var fbs []*builder.FileBuilder

    knownSymbols := make(map[protoreflect.FullName]bool)
    queue := []protoreflect.Descriptor{md}
    for len(queue) > 0 {
        head := queue[0]
        queue = queue[1:]
        if knownSymbols[head.FullName()] {
            continue
        }
        if head.IsPlaceholder() { // Skip any missing files in the dependency graph.
            continue
        }

        pf := head.ParentFile()
        idx, found := builderIdx[pf.Path()]
        if !found { // We need another FileBuilder for a new .proto file in this transitive closure.
            idx = len(fbs)

            fbs = append(fbs, prepareFileBuilder(pf.Path(), string(pf.Package())))
            builderIdx[pf.Path()] = idx
        }
        knownSymbols[head.FullName()] = true

        switch descriptor := head.(type) {
        case protoreflect.MessageDescriptor:
            jmd, err := desc.WrapMessage(descriptor)
            if err != nil {
                return nil, err
            }
            mb, err := builder.FromMessage(jmd)
            if err != nil {
                return nil, err
            }
            fbs[idx].AddMessage(mb)

            fields := descriptor.Fields()
            for idx, len := 0, fields.Len(); idx < len; idx++ {
                switch f := fields.Get(idx); f.Kind() {
                case protoreflect.MessageKind:
                    if f.IsMap() {
                        if mv := f.MapValue(); mv.Kind() == protoreflect.MessageKind {
                            queue = append(queue, mv.Message())
                        }
                        continue
                    }
                    queue = append(queue, f.Message())
                case protoreflect.EnumKind:
                    queue = append(queue, f.Enum())
                }
            }
        case protoreflect.EnumDescriptor:
            jed, err := desc.WrapEnum(descriptor)
            if err != nil {
                return nil, err
            }
            eb, err := builder.FromEnum(jed)
            if err != nil {
                return nil, err
            }
            fbs[idx].AddEnum(eb)
        default:
            panic("unhandled?")
        }
    }

    var fds []*desc.FileDescriptor
    for idx := len(fbs) - 1; idx >= 0; idx-- {
        built, err := fbs[idx].Build()
        if err != nil {
            return nil, err
        }
        fds = append(fds, built)
    }
    return desc.ToFileDescriptorSet(fds...), nil
}

func prepareFileBuilder(path, packageName string) *builder.FileBuilder {
    fb := builder.NewFile(path)
    fb.SetProto3(true)
    fb.SetPackageName(packageName)
    return fb
}
jhump commented 2 months ago

I am pretty sure the problem is that you are mixing builders with already-built descriptors. When you do builder.FromMessage(jmd), you get a message builder that references already-built file descriptors for all of its dependencies. And then later, you separately add those same dependencies. The typical issue with this would be a case where you have multiple files with the same name -- the copy that comes from your builder plus the reference to the already-built file. But when a message field refers to other messages in the same file, it can also cause import issues where your file builder incorrectly is trying to import the already-built copy of the file with the same path.

I'm afraid, for what you're doing, you'd need to re-create each message descriptor so that the field's refer to other message builders, not to already-built descriptors.

I'll leave this issue open as a feature request: we could possibly add a new field to BuilderOptions that allows lenience. The safest case would be, if there are both builders and already-built descriptors referenced for the same element (identified by fully-qualified name), then always use the builder. But I think this could be a bit tricky to implement, and there are other edge cases that wouldn't be clearly handled (such as cases where the transitive graph includes more than one already-built descriptor for the same element but no builders, or cases where the graph includes multiple builders for the same element).

kevin-break commented 2 months ago

Thanks, jhump, yes, that confirms my suspicion.

Anyway, I've moved on and instead of building I simply stripped down un-necessary symbols from the original descriptor messages (not the descriptors themselves, as there they are interfaces and offer no Remove API), and it served my purposes.

Besides your proposed improvement (I agree it can be a bit tricky, but still useful to some extent), do you have better ideas on how to retrieve the minimal transitive enclosure (for a given symbol)? I would expect the usage is not so much limited.

jhump commented 2 months ago

I implemented something like that in the buf CLI code: https://github.com/bufbuild/buf/blob/v1.32.0/private/bufpkg/bufimage/bufimageutil/bufimageutil.go#L177 In particular, your sample code doesn't handle custom options, Any messages, or source code info.

That link above basically just uses descriptor protos, not full-blown descriptors. It creates its own minimal index of elements, just enough to do the trimming (which is hopefully cheaper than wrapping everything in full-blown descriptors, but admittedly not yet benchmarked...).

jhump commented 2 months ago

Note that link is for informational purposes only. You might fork that code and adapt it to your needs. It is not usable as is -- if you write a program that tries to import that package, it will panic during initialization. (Everything under github.com/bufbuild/buf/private is that way -- it's meant to be an "internal" package, but it's not named internal so that we can use it from some of our own stuff outside this repo.)