golang / protobuf

Go support for Google's protocol buffers
BSD 3-Clause "New" or "Revised" License
9.64k stars 1.58k forks source link

Random 0xfeff / ZERO WIDTH NO-BREAK SPACE being added to returned string values #1623

Open agruetz opened 6 days ago

agruetz commented 6 days ago

This may be expected behavior I am not sure and cannot find any information about it. In some cases my protobuffer strings are returned with a ZERO WIDTH NO-BREAK SPACE.

`message Work { Identifer id = 1; WorkType work_type = 2; string command = 3 [ (google.api.field_behavior) = OPTIONAL, (grpc.gateway.protoc_gen_openapiv2.options.openapiv2_field) = { title: "Work Request Command" description: "Command to perform the work request." } ]; }

node, err := a.srvClient.client.GetWork(ctx, &npb.GetWorkReq{Hardware: &npb.HardwareInfo{MacAddr: a.cfg.host.macAddr, IpAddr: a.cfg.host.ipAddrs}}) if err != nil { return err }

for _, cmd := range node.Work {
    switch cmd.WorkType {
    case npb.WorkType_INSTALL:
        if a.cfg.agent.mode == dev {
            //TODO FIX THE STUDDER (has to be fixed in ProtoFile)
            err = a.installPrimary(cmd.Id.Id, primaryDev)
            if err != nil {
                return err
            }
        } else {
            err = a.installPrimary(cmd.Id.Id, primary)
            if err != nil {
                return err
            }
        }
    case npb.WorkType_EXEC:
        cmdWithArgs := strings.Split(strings.TrimSpace(cmd.Command), " ")

        err = a.execCmd(cmd.Id.Id, cmdWithArgs[0], cmdWithArgs[1:]...)
        if err != nil {
            return err
        }
    default:
        //LOG BAD CMD TYPE
        //TODO LOG
        return fmt.Errorf("unknown work type: %s", cmd.WorkType)
    }`

I would expect cmd.Command to not contain random ZERO WIDTH NO-BREAK SPACE.

Any insights would be appreciated.

puellanivis commented 6 days ago

You haven’t provided any error messages or examples of the text containing a ZWNBSP or where in the string.

However, if this is happening at the start of your string, then this is probably a result of Byte-Order Marking, where a string starts with 0xfeff, and since 0xfffe is defined as an invalid Unicode codepoint, you can then identify if you’re dealing with UTF-16LE, from UTF-16BE. Especially, if it’s pulling this data from lines from a Windows text file, like .BAT as it is known to add these BOMs in files saved in Unicode.

agruetz commented 6 days ago

Sorry for that missing information. The data is come from a MySQL select query. It is essentially has a gRPC api server that node.GetWork is calling and returning this.

I have confirmed that inside of the server it is not being added. It is being added some where in the encoding and transfer across the wire and then the subsequent decode on the client side.

Yes I have been able to work around it by specifically stripping the 0xfeff character from the string but it seems odd it is there in the first place.

I also agree that this is likely the result of Byte-Order Marking because it is at the very start of the string.

What I find most odd is that it only happens sometimes, it is not every string. It almost feels as if it is being used as padding for the encode/decode for the wire transfer but is not properly being stripped off in all cases.

I am happy to provide more detail or code or debug out put, I just was not sure what all would be helpful. Or if this was some known expected behavior I was not aware of.

puellanivis commented 6 days ago

Protobuf doesn’t typically use any padding let alone 0xfeff specifically.

Have you tried looking at the raw MySQL query values directly? Maybe someone is copy-pasting in from a Windows text file somewhere? It can be a notoriously difficult character to notice because it’s zero-width, and thus might not seem to show up normally.

Maybe a short copy of an encoded Work message that triggers the issue? I maybe wouldn’t jump straight to copy-pasting here an excerpt of the MySQL data for that message, but also, it probably wouldn’t hurt.