golang / protobuf

Go support for Google's protocol buffers
BSD 3-Clause "New" or "Revised" License
9.74k stars 1.58k forks source link

customizible encoder to hide/redact sensitive information #1541

Open ofen opened 1 year ago

ofen commented 1 year ago

Is your feature request related to a problem? Please describe. We need ability to hide sensitive information during pb to json encoding.

Describe the solution you'd like Customizible encoder that will allow to work with options (see https://go.dev/blog/protobuf-apiv2).

Describe alternatives you've considered Make most of internal parts importable to minimize copy-paste.

Additional context None

lfolger commented 1 year ago

Hi ofen,

Sorry for the late reply.

I think to design a generic solution for such a problem is a large task and requires a lot of design work to make sure it works properly for all cases. Unfortunately, I don't think we have the capacity to design and implement such a solution at the moment.

It might be possible for you to implement the Redact function in the linked proposal yourself and use it in your code base in a wrapper around the protojson.Marshal.

func marshalToJson(m proto.Message) ([]byte, err) {
  // m = proto.Clone(m) // in case you cannot modify `m`
  redact(m)
  return protojson.Marshal(m)
}

If you do not want to modify the message, you might be able to proto.Clone() it before redacting the information.

I'm not sure I understand what you mean by the alternatives you described. To which internal parts would you need access to implement this?

ofen commented 1 year ago

Hi, sorry for late reply.

Protobuf redaction with following marshaling performs poorly due to double iteration over nested elements.

Internal part that could be used for redaction functionality, but lacks public interface to modify it's behaviour: https://github.com/protocolbuffers/protobuf-go/blob/b8fc7706010499f46982c883add4351b12e30c0b/encoding/protojson/encode.go#L259

lfolger commented 1 year ago

I agree that the double iteration approach is more expensive than if there were hooks directly in the encoder. If performance is critical to you, json might not be the best choice for you anyway (although I don't know your complete setup and the constraints you are working with).

Do you have an idea how large the performance overhead is/would be? Did you do an experiment to find out by how much the performance could be improved if hooks were available? You could do such an experiment by modifying the package locally and running some benchmarks.

While double iteration can be cause performance issues, it is generally fairly fast. Are the messages you are dealing with deeply nested and have many fields? Are there other reasons that make the iteration expensive?

All that said, I want to be honest and I don't think we have the capacity to design this API extension in the near future. However, you are welcome to contribute such a feature. Either post your ideas here or provide a proof of concept change with the extensions you would like to see and we can evaluate this during the review.

Olex1313 commented 1 year ago

@lfolger, Hello, I've came across this issue with same questions as the @ofen. My proposal is to extract the visitFunction and make it default visitFunction for MarshalOptions. So it will look like this:

func WriteJsonField(fd protoreflect.FieldDescriptor, v protoreflect.Value) bool {
        // Current logic from line 228
}

type MarshalOptions struct {
    pragma.NoUnkeyedLiterals
        // all-the-current-options fileds
        VisitFn VisitField
}

// Sample usage
func writeAndClear(fd protoreflect.FieldDescriptor, v protoreflect.Value) bool {
    clear(fd, v)
    return WriteJsonField(fd, v)
}
opt := protojson.MarshalOptions{VisitFn: writeAndClear}

If it looks good enough I can make a PR with the implementation.

However if it is too open realisation or too customisable, I can figure out something else

puellanivis commented 1 year ago

Would writeAndClear even work without mutating the original? Or otherwise, wouldn’t writeAndClear clear the value and then output it? So “clearAndWrite`?

Olex1313 commented 1 year ago

The clearAndWrite sounds better, didn't think about it :)

About the mutating the original object, I think it is up to library user to mutate the values so general usage would look like

clonedMessage := myMessage.clone()
opt := protojson.MarshalOptions{VisitFn: clearAndWrite}

logger.logStructured(opt.Marshall(clonedMessage)
lfolger commented 1 year ago

@Olex1313 this would require us to exposse the encoder because it is used in the code 228+.

This is a quite large change and I don't think we can do this right now without sufficient evidence that it has a significant impact.

Again, if you can share some pprof profiles or benchmarks that show the difference, it would help us to evaluate such a change.

Olex1313 commented 1 year ago

@lfolger, you are right, I'm writing clone-based solution right now, I will provide benchmarks when I finish to see if it is really that necessary, but for repeated fields or maps I think it would be great to have a cheap masking option

yonesko commented 12 months ago

@Olex1313 have you started the project? мне тоже нужно)

Olex1313 commented 12 months ago

@ofen, sorry for late answer, been quiet busy. Sample gist with proto clearup: https://gist.github.com/Olex1313/88ff52b1f4cad1af15c6733b4553b4cb As for benchmarks I'll post it today evening, after switching from work to personal laptop, right now can say that clearing takes the same time as the marshalling, where for both the most time-consuming is the iterating over proto fields

Mahes2 commented 9 months ago

Hi @Olex1313 do you mind to share the progress?

Recently i'm looking for a solution of this issue. I also try to do some research and update. If you guys don't mind i can try to update, but need your help to review it.

My solution is that:

  1. adding HideSensitiveMessage bool into MarshalOption struct
  2. updating the RangeFields into:

    order.RangeFields(fields, order.IndexNameFieldOrder, func(fd protoreflect.FieldDescriptor, v protoreflect.Value) bool {
        name := fd.JSONName()
        if e.opts.UseProtoNames {
            name = fd.TextName()
        }
    
        if fd.IsSensitiveField() && e.opts.HideSensitiveMessage {
            return true
        }
    
        if err = e.WriteName(name); err != nil {
            return false
        }
        if err = e.marshalValue(v, fd); err != nil {
            return false
        }
        return true
    })
  3. this step in on_progress, adding new tag is_sensitive_field for the proto file

please let me know if this update is not suitable or can have issue. cc @puellanivis

Olex1313 commented 9 months ago

@Mahes2, I can not share the benchmarks (did it for work), but as the @puellanivis said it's very complex design issue in protobuff serialization, and this doesn't go well with the purpose of Proto to JSON, because it works only in one way, but obviously you can't convert json->proto afterwards.

I've shared a gist https://gist.github.com/Olex1313/88ff52b1f4cad1af15c6733b4553b4cb, it contains sample approach to clearing the message for logs/etc.

For implementation in https://github.com/golang/protobuf I think it should be a third-party library, but not the part of it

Mahes2 commented 9 months ago

I see. but is it ok if i give contribution to add new tag and using the approach i have given above?

For implementation in https://github.com/golang/protobuf I think it should be a third-party library, but not the part of it

why it should be a 3rd party library?

Olex1313 commented 9 months ago

@Mahes2

I see. but is it ok if i give contribution to add new tag and using the approach i have given above?

It's up to maintainers, if they are ok with such approach I would like to help w code/review

why it should be a 3rd party library?

As @lfolger said upwards, designing custom deserialization api is out of context of encoder api, because it provides a simple way of ser/des of proto objects. Hiding sensible data is not the single feature that should be provided, and api should be extendable for more than that. My personal opinion that if you really have to hide data in serialized protos you can even use a encoding/json package with json:"-" annotation, or clear proto message and there are more ways how it can be implemented. The main question is should this really be part of golang/protobuf package, because it is kinda out of context and more specific usage.

ofen commented 9 months ago

I think implementing redaction logic inside protojson serialiser is bad idea, but it still good idea to have some internal parts accessible/public to let developers build their custom logic upon it.

puellanivis commented 9 months ago

I see. but is it ok if i give contribution to add new tag and using the approach i have given above?

Unfortunately, this is—as I said in your first issue—not going to be accepted. This package needs to stick fairly strictly to the protobuf standard, and this feature is definitely something that we don’t want to walk into. (Even adding the json tag itself is a well-known mis-step of the package, because it was just so easy to add them, and get JSON, right? And once protobuf actually established a proper JSON mapping, it was entirely incompatible with the new JSON mapping.)

Mahes2 commented 9 months ago

Hi @puellanivis can you share me any doc or blog that has explaination about the protobuf standard? because i still don't know which standard that you are talking about.

Also, is the proper JSON mapping already in progress?

Finally, i thank you guys for giving the tips on how to handle the sensitive message. I did a little update with @Olex1313's suggestion and add it to my libs.

I have made a PR and it's also included with the benchmark Really appreciate it if you guys @puellanivis @Olex1313 @ofen @lfolger can help me review it as well. thank you

https://github.com/Mahes2/go-libs/pull/9/files

puellanivis commented 9 months ago

https://protobuf.dev/programming-guides/proto3/ is the proto3 standard.

The proper JSON mapping is already complete, and done. It was done for v1 API with jsonpb and for the v2 API in protojson.