Headers keys are always raw bytes and never strings when running on python 3

Gyllsdorff commented 9 years ago

This might be intentional but in that case it should probably be mentioned somewhere in the documentation.

Right now we have to remember to put a b before the string key every time we try to access a header value while running python 3.

message.properties.get('headers').get(b'foo').decode()

The same thing applies to the message.properties values. They are always raw bytes and never Strings even if the field description is short_str

The problem seems to be in pamqp decode.py:370, _maybe_utf8().

When _maybe_utf8 is called while running python 3 it will always return bytes and never str or unicode. Since _maybe_utf8 is used while decoding table keys it also causes the header keys to be bytes.

gmr commented 9 years ago

I can try and coerce bytes into a str, which would leave the key as bytes if the key is utf-8 encoded. Thoughts? I don't like having the mix there though. The key change for Python3 was to use bytes everywhere, which seems like the "right thing" to do. But since I'm not 100% in Python3 land yet, I'm not opinionated on what's right and wrong there.

Gyllsdorff commented 9 years ago

Well, the big change in python 3 is that content is either text or data while in python 2 you had str which was 8-bit data which could be interpreted as a string and unicode which was a series of codepoints.

In python 3 a str is a sequence of Unicode code points that might have been decoded from a series of bytes and bytes are a sequence of single bytes. This means that the b'' syntax only works for characters which unicode codepoint is < 128

# Works
b'foo'

# Does not work
b'foo ö'

The AMQP specification never really mentions any encodings, the only releveant part is on page 35. It looks like short strings should always be UTF-8 encoded.

AMQP strings are variable length and represented by an integer length followed by zero or more octets of
data. AMQP defines two native string types:
- Short strings, stored as an 8-bit unsigned integer length followed by zero or more octets of data. Short
strings can carry up to 255 octets of UTF-8 data, but may not contain binary zero octets.
- Long strings, stored as a 32-bit unsigned integer length followed by zero or more octets of data. Long
strings can contain any data

Unfortunatly it does not mention the encoding of long strings or header keys/values but if the headers are not valid UTF-8 then RabbitMQs "Header" routing seems to break.

There have been a few discussion on the rabbitmq mailing list Ascii text for strings? and Unicode Characters In AMQP Headers? and the consesus appears to be that Strings should be utf-8 encoded or ASCII.

My advice is that if a field claims to be short_str then try to decode it to a string using utf-8 encoding. If it claims to be bytes then do nothing to it.

The RabbitMQs java client uses Map<String, Object> as the header container.

gmr commented 9 years ago

Your understanding matches mine, and the spec is clear re short-strings and field table keys.

For field table keys, they are always decoded as utf-8 and will always be bytes.

For field table values, RabbitMQ only supports long-strings and thus there is no decoded/encoding done except if a str value is passed in, in Python3, in which case I try and convert it to bytes using ascii as the codec, as to be as unopinionated as to what the bytes are as possible.

All that being said, aside from the new behavior in pamqp 1.6.0 with header field values not applying any logic to them, I do not think there is any action to take on this ticket. Do you disagree?

Gyllsdorff commented 9 years ago

I understand. It's to bad that the specifications never specifies that encoding.

Unfortunately, this is a critical issue for us.

We have several producers that sends in dictionaries/tables in the headers which we sometimes transforms into JSON and sends the data to REST endpoints. The 'bytes as dictionary keys' means that pythons json.dumps crashes.

This also means that if we copy a dictionary header from the incoming message to a outgoing message body rabbitpy will crash when it tries to auto-serialize it. Since both the keys and the values of the dictionary are bytes we have to rebuild the entire datastructure just to decode the keys.

I will run a few a few tests on our Java, nodejs, ruby and php rabbitmq clients and see how they handle the header keys.

You can close this if you feel there is nothing more we can do about this issue.

gmr commented 5 years ago

I believe this issue is addressed with newer versions of pamqp.

gmr / rabbitpy

Headers keys are always raw bytes and never strings when running on python 3 #51