msgpack / msgpack-php

msgpack.org[PHP]
BSD 3-Clause "New" or "Revised" License
775 stars 120 forks source link

Packing/unpacking the bin format family #72

Open rybakit opened 8 years ago

rybakit commented 8 years ago

Packing/unpacking bin data fails for me:

var_dump(bin2hex(msgpack_pack("\x80")));

Expected: string(6) "c40180" Actual: string(4) "a180"

var_dump(bin2hex(msgpack_unpack("\xc4\x01\x80")));

Expected: string(2) "80" Actual: PHP Warning: [msgpack] (php_msgpack_unserialize) Parse error in ...

$ pecl list

INSTALLED PACKAGES, CHANNEL PECL.PHP.NET:
=========================================
PACKAGE VERSION STATE
msgpack 0.5.6   beta
$ php -v

PHP 5.5.9-1ubuntu4.14 (cli) (built: Oct 28 2015 01:34:46)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
    with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies
    with Xdebug v2.2.3, Copyright (c) 2002-2013, by Derick Rethans
Sean-Der commented 8 years ago

Hey @rybakit!

Good news first! Using msgpack at 105961a93599a5a12f2e33eaebe7b25ac69f1984 var_dump(bin2hex(msgpack_unpack("\xc4\x01\x80"))); gives me string(2) "80"

Bad news is that we are just packing your data as string/raw this. Even though this is labled raw it was renamed to str in the new spec (but has the same leading bits)? So the a1(101) you are seeing is just a one byte string!

All I did was diagnose stuff though, didn't give you any answers. I think everything is working correctly? Feel free to close if that answer is good, or I would be surprised if I was completely confused :) (I read this through about 5 times...)

rybakit commented 8 years ago

Hi @Sean-Der!

According to the spec:

String extending Raw type represents a UTF-8 string Binary extending Raw type represents a byte array

So I would expect "\x80" to be packed as a byte array (bin 8), not a string. I've also tested it with a python library which returns c40180, not a180.

Sean-Der commented 8 years ago

@rybakit

Right now we don't look at the string we are packing, so "\x80" is just handled like every other string. I will look at the Python library and see how they implemented packing (There isn't a dedicated binary type, so maybe they walk the string and decide based on its contents?)

Is this behavior breaking PHP <-> Python for you right now?

rybakit commented 8 years ago

There isn't a dedicated binary type, so maybe they walk the string and decide based on its contents?

They check if the string is Unicode before packing it:

...
elif isinstance(obj, unicode):
    _pack_string(obj, fp)
elif isinstance(obj, str):
    _pack_binary(obj, fp)
...

Is this behavior breaking PHP <-> Python for you right now?

I'm not affected by this right now, but I saw these tickets (#36, #53) were closed and assumed that the bin family is fully supported. I also wrote a pure php implementation of the msgpack protocol and tried to run my tests against this pecl extension and found that they fail on bin and ext formats.

Sean-Der commented 8 years ago

I really appreciate you making msgpack-php better just because! Sure needs some people who care about quality.. as I mess things up!

Let me look at what I can do with this! The only part that is rough with this is that check will increase runtime. All I have is a char * + int, so I will have to see what the right way to handle that.

thanks (also I owe you a merge and a bugfix already I should have time this week)

rybakit commented 8 years ago

@Sean-Der Maybe this can be used to detect utf8 strings? Also, you may consider adding a flag to enable/disable utf8/bin auto-detection. That will help to keep encoding fast for the cases when the string type is known beforehand.