protocolbuffers / protobuf

Protocol Buffers - Google's data interchange format
http://protobuf.dev
Other
65.39k stars 15.46k forks source link

Proto2 protobuf cannot handle enums with value higher than 31 during serialization in Python #10950

Closed tinix0 closed 1 year ago

tinix0 commented 1 year ago

What version of protobuf and what language are you using? Version: protoc 3.21.9 Language: Python - protobuf 4.21.9

OS: Windows 11

Interpreter: Python 3.11.0

What did you do?

  1. Create proto2 protobuf that contains Enum with values larger than 31
  2. Compile using protoc
  3. Instantiate the protobuf in code and set the enum field to value larger than 31
  4. Serialize the protobuf to string
  5. Deserialize the protobuf

What did you expect to see The deserialized protobuf has the enum field set to the value larger than 31

What did you see instead? The deserialized protobuf has the field clear (the serialized form does not contain the value at all).

Anything else we should know about your project / environment I can provide minimal reproduction example if desired, but it was pretty simple to reproduce. The issue was also reproduceable on protobuf lib versions as low as 4.21.3 and with grpcio-tools as low as 1.49.1. Lastly, switching the syntax to proto3 does fix the issue, however it is currently not viable for us.

acozzette commented 1 year ago

Thank you for the bug report. This sounds like a potentially serious bug. Would you mind providing the reproducible example?

haberman commented 1 year ago

Yes a repro would be very helpful here. I was not able to reproduce it:

// test.proto
syntax = "proto2";

enum E {
  ONE = 1;
  THIRTY_ONE = 31;
  THIRTY_TWO = 32; 
  ONE_HUNDRED = 100; 
}       

message M {
  optional E e = 1;
} 
# test.py
from test_pb2 import M, E
import unittest

class TestEnumValueSerialize(unittest.TestCase):
    def do_test_val(self, val):
        m1 = M(e = val)
        serialized = m1.SerializeToString()
        m2 = M()
        m2.ParseFromString(serialized)
        self.assertEqual(val, m2.e)

    def test_enum_values(self):
        self.do_test_val(E.ONE)
        self.do_test_val(E.THIRTY_ONE)
        self.do_test_val(E.THIRTY_TWO)
        self.do_test_val(E.ONE_HUNDRED)

if __name__ == '__main__':
    unittest.main()

Output:

$ venv/bin/python test.py --verbose
test_enum_values (__main__.TestEnumValueSerialize) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
tinix0 commented 1 year ago

Hello, here is the repro with output. Note that I have also seen the issue with holes in the numeric ranges, but the total number of values was still larger than 31.

// test.proto
syntax = "proto2";
package Test;

message Person {

    enum PhoneType {
      blaa = 0;
      HOME = 1;
      WORK = 2;
      a = 3;
      b = 4;
      c = 5;
      d = 6;
      e = 7;
      f = 8;
      g = 9;
      h = 10;
      i = 11;
      j = 12;
      k = 13;
      l = 14;
      m = 15;
      n = 16;
      o = 17;
      p = 18;
      q = 19;
      r = 20;
      s = 21;
      t = 22;
      u = 23;
      v = 24;
      w = 25;
      x = 26;
      y = 27;
      z = 28;
      aa = 29;
      ab = 30;
      ac = 31;
      ad = 32;
      ae = 33;

    }

    optional PhoneType TaskType = 1;
    optional string Name = 2;
  }
# main.py
import test_pb2

person = test_pb2.Person()
person.TaskType = 32
person.Name = "Name1"
serialized = person.SerializeToString()
personDe = test_pb2.Person()
print("a", person)
print("---")
personDe.ParseFromString(serialized)
print("b", personDe)
print("---")

person2 = test_pb2.Person()
person2.TaskType = 31
person2.Name = "Name2"
serialized2 = person2.SerializeToString()
personDe2 = test_pb2.Person()
print("a", person2)
print("---")
personDe2.ParseFromString(serialized2)
print("b", personDe2)

Ouput:

a TaskType: ad
Name: "Name1"

b Name: "Name1"

a TaskType: ac
Name: "Name2"

b TaskType: ac
Name: "Name2"
haberman commented 1 year ago

This must be a Windows-specific bug. When I try your repro on Linux, I get:

a TaskType: ad
Name: "Name1"

---
b TaskType: ad
Name: "Name1"

---
a TaskType: ac
Name: "Name2"

---
b TaskType: ac
Name: "Name2"

I suspect this is related to long being only 32-bits on Windows. I will investigate further.

ericsalo commented 1 year ago

I also cannot reproduce the bug on (64-bit) Linux.

tinix0 commented 1 year ago

I can also confirm that I cannot reproduce it inside WSL. It does indeed seem to be Windows specific (both machines I tried this on were either Windows 11 or Windows 10).

haberman commented 1 year ago

Can you print out len(serialized)? I suspect it will be non-zero, but want to verify.

(I don't have quick access to a Windows machine to verify).

tinix0 commented 1 year ago

len(serialized): 9 len(serialized2): 9 serialized in base64: 'CCASBU5hbWUx' serialized2 in base64: 'CB8SBU5hbWUy'

haberman commented 1 year ago

Great, thanks. It appears to be serializing correctly. When we parse it though, the decoder thinks the value (32) is not in the enum for some reason. In proto2, when you parse an unrecognized enum value, it goes into the unknown field set, which is what appears to be happening here.

The only question now is why the decoder thinks that 32 is an unknown value for this enum.

haberman commented 1 year ago

This is the code performing the check. The bug is likely here or nearby, but I can't see it: https://github.com/protocolbuffers/upb/blob/f45eeec625a1b5b4514683aba4b38d1864f0e211/upb/decode.c#L410-L440

In particular, this check looks correct to me. We cast 1ULL to unsigned long long, so it should be 64-bit even on Windows, which has 32-bit longs: https://github.com/protocolbuffers/upb/blob/f45eeec625a1b5b4514683aba4b38d1864f0e211/upb/decode.c#L437

tinix0 commented 1 year ago

Another helpful thing I can point out is that this is an regression introduced in 4.21.x. Protobuf 3.20.3 deserializes the message correctly.

haberman commented 1 year ago

Another helpful thing I can point out is that this is an regression introduced in 4.21.x.

That makes perfect sense, as 4.21.x introduced a completely rewritten native extension. More info here: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

ericsalo commented 1 year ago

I would be mildly surprised if the bug turned out to be in decode.c since that code has not changed much at all. I'm wondering whether this may be related to the mini descriptors.

haberman commented 1 year ago

It's true that decode.c has not changed much, but we also have much less usage on Windows than other platforms.

Keep in mind that we're releasing from the 21.x branch, which has not gotten any of the recent changes around MiniDescriptors: https://github.com/protocolbuffers/upb/tree/21.x/upb

haberman commented 1 year ago

I have a root cause and a fix for the bug. The problem was in the UPB_LIKELY() macro, which uses __builtin_expect(). The arguments to __buitin_expect() are long, so on Windows this truncates the arguments to 32 bits.

The fix is to cast the argument to bool:

--- a/upb/port_def.inc
+++ b/upb/port_def.inc
@@ -96,8 +96,8 @@

 /* Hints to the compiler about likely/unlikely branches. */
 #if defined (__GNUC__) || defined(__clang__)
-#define UPB_LIKELY(x) __builtin_expect((x),1)
-#define UPB_UNLIKELY(x) __builtin_expect((x),0)
+#define UPB_LIKELY(x) __builtin_expect((bool)(x),1)
+#define UPB_UNLIKELY(x) __builtin_expect((bool)(x),0)
 #else
 #define UPB_LIKELY(x) (x)
 #define UPB_UNLIKELY(x) (x)
ericsalo commented 1 year ago

A fix for this has just been pushed so I am closing.