ubjson / universal-binary-json

Community workspace for the Universal Binary JSON Specification.
115 stars 12 forks source link

Unclear how to determine difference between 1 and 4 byte length types; 4-byte lengths should allow for lengths much longer (2**32-1 instead of 2**31-1) #35

Closed micolous closed 11 years ago

micolous commented 11 years ago

From the documentation, it's unclear how a parser is supposed to determine the difference between a 1 and 4 byte length type. The documentation states that valid ranges for 1-byte values are 0..254.

Additionally, the documentation states that the range of 4 byte lengths is 0..2,147,483,647, where it should probably be 0..4,294,967,295? Or maybe even 0..16,777,215 (uint24 following first byte of 0xFF)?

micolous commented 11 years ago

Oh, I see later in the documentation it appears to be defined by the case of the type character. However this isn't specified clearly under the "Data Format" header.

However issues about length still stand.

ghost commented 11 years ago

You are looking at the front page. Go to the Type Reference for the latest draft (Draft 9). Very confusing.

ghost commented 11 years ago

@infradig is correct -- I've rewritten the Type Ref and subpages, but not the front page.

micolous commented 11 years ago

OK, but it hasn't addressed my question about the 4-byte lengths. Shouldn't this allow up to 232-1 length of data (instead of 231-1)? Or is this "reserved" in case a length expansion is required?

ghost commented 11 years ago

Lengths are signed, though only positive values are used/acceptable.

kxepal commented 11 years ago

Prolog: @thebuzzmedia , I was told you about!(:

OK, but it hasn't addressed my question about the 4-byte lengths. Shouldn't this allow up to 232-1 length of data (instead of 231-1)? Or is this "reserved" in case a length expansion is required?

That's because the main idea was to have single logic of integer markers processing. Since all integer markers are signed, length automatically also is signed value. Otherwise you have to support in for driver two different logics for integer markers: case for length and case for value. While it looks weird for the first sight, UBJSON cares not only for simplicity of format, but also for simplicity of code that implements such format.

micolous commented 11 years ago

That's because the main idea was to have single logic of integer markers processing. Since all integer markers are signed, length automatically also is signed value.

Should there be support for unsigned datatypes in #34? I would argue that if you're trying to compact data that having the data only stored in signed form limits your range, and that it's not any harder to handle a signed datatype vs. an unsigned one.

ghost commented 11 years ago

@micolous Can you clarify "not any harder"? The issue is introducing all those new markers that symbolize unsigned versions of the same numeric types.

I realize this is an optimization, but not for the 80% case.

micolous commented 10 years ago

Lengths are signed, though only positive values are used/acceptable.

wat. really?

Can you clarify "not any harder"? The issue is introducing all those new markers that symbolize unsigned versions of the same numeric types.

Borrowing codes from Python's struct library:

if input[0] == 'H': # uint16
  yield struct.unpack('!H', input[1:3])
  input = input[3:]
elif input[0] == 'h': # int16
  yield struct.unpack('!h', input[1:3])
  input = input[3:]
# ...
else:
  # unknown valuetype
  raise ValueError, 'unknown value type %r' % input[0]

It adds another case to them.

kxepal commented 10 years ago

Hey, @micolous

Lengths are signed, though only positive values are used/acceptable. wat. really?

Yes, it's really. As you Python guy, take a look on simpleubjson implementation. Please compare numbers and strings decoding. It's the same except of nicer exception messages and my speed hunting. In common case you'll replace all the code from 150-169 lines with single next_tlv call. If length markers will be unsigned, you cannot do this.

micolous commented 10 years ago

Ah that's a bit better than my shoddy code above. :)

I can see how that represents an optimisation, but I don't understand why you would want a string that has a length of less than 0 bytes. .NET does the same thing (and not just for strings), and it seems broken.

Or it may be good to allow both (and even allow unsigned values for string lengths, because they actually make more sense than having just signed ones), and have some catch-all assertion that the length of a string is at least 0 bytes.

ghost commented 10 years ago

@micolous It is an (intentionally) leaky abstraction of how integer values are defined.

As @kxepal pointed out, we want the code simple (reading and writing); the side effect of which is that all numeric types, regardless of where they are used, are read/written the same way... even when it is a length of a datatype.

The alternative, defining slightly different numeric reading/writing rules and code IF the number is a length as opposed to a value itself... sucks and can lead to mental-complexity trying to understand the spec as well as code complexity.

We actually used to have this. Draft 8 corrected it.

micolous commented 10 years ago

You've got a danger point though -- the format allows you to have less-than-zero length array/string types. This is bad. It is a specification that can lead a programmer to make mistakes.

For example, that Python implementation above, line 169 says:

return tag, length, self.read(length)

Where length is defined by unpacking the binary length value before it.

So what happens if I have a string with a length of -1? The documentation for file.read() says:

When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory. Otherwise, at most size bytes are read and returned.

So the effect of this is when deserialising UBJ-format data from a file, it will then read the entire file until EOF, and fail to read the rest of the data serialised in the file. Some other implementations of file-like objects in Python may have different behaviours.

This problem isn't unique to Python -- other languages give special meanings to negative parameters, and often different meanings to different negative values.

It doesn't appear that the Java implementation has this problem, as there's a specific assertion and exception raised in this case.

It could be that another way to solve this is to just have everything unsigned, and have a special datatype for negative numbers. This way the array and string handling code essentially doesn't care about the fact it is signed or not.

On the encoding of numbers, as this specification is essentially trying to compete in the same space as protobuf (which has many language implementations), it would be good to have a look at how they decided to encode numbers.

They have varints which essentially encode the entire number as base128, using the most compact encoding they can. They use a zig-zag algorithm to encode signed datatypes using the least significant bit of the number (rather than most significant bit), such that a negative value can always be packed.

This means that instead of having a if-else if chain looking at the lengths and trying to decide on the best way to encode it, you flip that signed bit around, then look for the number of leading 0-bits you can trim from the start of the packed integer. You then encode the remaining bits in base-128, using the MSB to indicate if there are more bytes to the integer. You could even end up with things like 3, 5 or 27 byte integers, which the fixed traditional datatypes do not allow, and it would always end in using the smallest amount of space.

ghost commented 10 years ago

@micolous I appreciate you digging into the details like that; definitely an interesting side effect in the Python case.

The spec clarifies that negative values for lengths are illegal; it is my expectation that the different implementations would enforce that requirement (like the Java impl does).

In the cases where they don't, hitting an invalid value would have undefined/broken behavior. I felt this was a reasonable assumption and the win here is ease of reading/writing.

I have looked at varints before (many times) and I personally dislike the complexity quite a bit which is why I didn't bring it over to ubjson; also Smile already implements the zig-zag encoding if that is what you wanted: http://wiki.fasterxml.com/SmileFormatSpec#ZigZag_encoding_for_VInts

This means that instead of having a if-else if chain looking at the lengths and trying to decide on the best way to encode it

We don't currently have this issue; the signed value is just written out, unless I missed another point you were making?

kxepal commented 10 years ago

@micolous

You've got a danger point though -- the format allows you to have less-than-zero length array/string types. This is bad. It is a specification that can lead a programmer to make mistakes.

Nobody perfect. Write unittests. Be ensured for your code logic (:

For example, that Python implementation above, line 169 says:

return tag, length, self.read(length)

Where length is defined by unpacking the binary length value before it.

So what happens if I have a string with a length of -1?

In my case it's impossible due to all length unpacking operations are uses unsigned logic.

Anything below is about solving theoretical programmers mistake that he can do or can not. With any language and any data format you may hit your leg by yourself, but that's why we're programmers - we know how to avoid such mistakes (:

micolous commented 10 years ago

We don't currently have this issue; the signed value is just written out, unless I missed another point you were making?

Yup, I'm picking on the Python implementation again, as Python has only two integer types (int and long), where int is a sint32 or sint64 depending on processor, and long is very very very long... This gets more interesting with things like decimal which allows fixed and floating point numbers based on the sort of math you would learn in school.

So it has an if-elif chain to attempt to map integers to the smallest value it can automatically. This isn't so much the case in languages like C and Java where you have more rigid integer types, but higher level scripting languages will need to compact numbers as much as they can.

This will also affect JavaScript, where there is only a 64-bit signed floating point number type, and then multiple third-party implementations of fixed-precision numbers. Perl has an architecture-dependant integer type, which is upgraded to floating point. PHP attempts to copy Perl's behaviour, but fails miserably.

If I understand correctly, the target audience of UBJ is in web application land. Like it or not, that involves a lot of JavaScript, which means automatic resizing is a consideration, if data is being sent from the browser back to the server.

In my case it's impossible due to all length unpacking operations are uses unsigned logic.

Sorry, then I shouldn't pick on you for that! Instead I should pick on the fact that it's different logic to how the spec and the Java implementation want you to do it, and exactly what @thebuzzmedia is wanting to avoid -- different code for handling sequence lengths to what is handling regular integers!

There's also no check to make sure that the amount of data got back from that file.read call is correct, because there's no guarantee that you'll get that amount of data back, eg:

>>> f = open('/etc/passwd', 'rb')
>>> len(f.read(2**24))
1789

There's a whole bunch of lower level variable conditions that change this (such as EOF, or a socket not having any more data in the buffer), but the language of the Python documention of file.read() is pretty clear, "at most size bytes are read and returned".

Anything below is about solving theoretical programmers mistake that he can do or can not. With any language and any data format you may hit your leg by yourself, but that's why we're programmers - we know how to avoid such mistakes (:

I agree that we can make mistakes regardless. However the point of a specification is to try and not leave things open to interpretation, not leave things undefined, and make it as easy as possible to do it right. There should be only one way that it is logically possible to implement the specification, and that is the right way.

Unfortunately the intent of the Java implementation (which becomes reference) about re-using integer parsing code in the sequence-type parser is lost in the specification.

kxepal commented 10 years ago

I agree that we can make mistakes regardless. However the point of a specification is to try and not leave things open to interpretation, not leave things undefined, and make it as easy as possible to do it right. There should be only one way that it is logically possible to implement the specification, and that is the right way.

If specification clearly defines something, it isn't opened for free interpretation anymore (hey, Riyad! looks like negative string size point is missed). Anyway, don't you think that negative string length is logically accepted? I don't think that it may be interpreted in opposite way as to write(n) bytes instead of read same amount(:

There's also no check to make sure that the amount of data got back from that file.read call is correct, because there's no guarantee that you'll get that amount of data back

Sure. Also, you may not be sure that your UBJ data is not corrupted or forged. As you noted, in simpleubjson I don't verify string length - it's rare case when you may forge it without breaking parser, but still this is an error. However, this is about the edge of specification and implementation. I feel that we need to define some guidelines and standardised ubjson libraries behaviour at the edge cases.

micolous commented 10 years ago

Anyway, don't you think that negative string length is logically accepted?

What? I don't understand. The Python implementation reads it all as unsigned values by your own admission. As much as I point out that it wrong by specification, I believe it's a good thing to do.

However one of the desired outcomes is to be able to share that code for parsing integers with parsing sequence type lengths, in which case it needs to be able to:

Sure. Also, you may not be sure that your UBJ data is not corrupted or forged.

Data authenticity is totally out of the scope here. There are reasons for reading from a file-like object where it would return a number of bytes that is not the amount requested, and this does not indicate a forgery -- it could be something as simple as a congested network link. It does mean that the .read method needs to block in order for the parser to function correctly.

However in the case of hitting EOF, you would want to know if that file is possibly truncated. It's knowable if the file is truncated, so why not report that and actually fail loudly, instead of trying to cover it up when we can check?