python / cpython

The Python programming language
https://www.python.org
Other
63.22k stars 30.27k forks source link

Wire protocol encoding for the JSON module #64036

Open ncoghlan opened 10 years ago

ncoghlan commented 10 years ago
BPO 19837
Nosy @warsaw, @terryjreedy, @gpshead, @ncoghlan, @pitrou, @vstinner, @kdwyer, @ezio-melotti, @merwok, @socketpair, @vadmium, @serhiy-storchaka, @jleedev

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'library'] title = 'Wire protocol encoding for the JSON module' updated_at = user = 'https://github.com/ncoghlan' ``` bugs.python.org fields: ```python activity = actor = 'Clay Gerrard' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'ncoghlan' dependencies = [] files = [] hgrepos = [] issue_num = 19837 keywords = [] message_count = 26.0 messages = ['204764', '204765', '204799', '204805', '204811', '204864', '204873', '204904', '204939', '204960', '204963', '204976', '204978', '205023', '205415', '205416', '205530', '205531', '271700', '271701', '271775', '271776', '271778', '272726', '275617', '289203'] nosy_count = 16.0 nosy_names = ['barry', 'terry.reedy', 'gregory.p.smith', 'chrism', 'ncoghlan', 'pitrou', 'vstinner', 'kdwyer', 'ezio.melotti', 'eric.araujo', 'cvrebert', 'socketpair', 'martin.panter', 'serhiy.storchaka', 'jleedev', 'Clay Gerrard'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue19837' versions = ['Python 3.5'] ```

ncoghlan commented 10 years ago

In the Python 3 transition, we had to make a choice regarding whether we treated the JSON module as a text transform (with load[s] reading Unicode code points and dump[s] producing them), or as a text encoding (with load[s] reading binary sequences and dump[s] producing them).

To minimise the changes to the module API, the decision was made to treat it as a text transform, with the text encoding handled externally.

This API design decision doesn't appear to have worked out that well in the web development context, since JSON is typically encountered as a UTF-8 encoded wire protocol, not as already decoded text.

It also makes the module inconsistent with most of the other modules that offer "dumps" APIs, as those *are* specifically about wire protocols (Python 3.4):

>>> import json, marshal, pickle, plistlib, xmlrpc.client
>>> json.dumps('hello')
'"hello"'
>>> marshal.dumps('hello')
b'\xda\x05hello'
>>> pickle.dumps('hello')
b'\x80\x03X\x05\x00\x00\x00helloq\x00.'
>>> plistlib.dumps('hello')
b'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n<plist version="1.0">\n<string>hello</string>\n</plist>\n'

The only module with a dumps function that (like the json module) returns a string, is the XML-RPC client module:

>>> xmlrpc.client.dumps(('hello',))
'<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n'

And that's nonsensical, since that XML-RPC API *accepts an encoding argument*, which it now silently ignores:

>>> xmlrpc.client.dumps(('hello',), encoding='utf-8')
'<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n'
>>> xmlrpc.client.dumps(('hello',), encoding='utf-16')
'<params>\n<param>\n<value><string>hello</string></value>\n</param>\n</params>\n'

I now believe that an "encoding" parameter should have been added to the json.dump API in the Py3k transition (defaulting to UTF-8), allowing all of the dump/load APIs in the standard library to be consistently about converting to and from a binary wire protocol.

Unfortunately, I don't have a solution to offer at this point (since backwards compatibility concerns rule out the simple solution of just changing the return type). I just wanted to get it on record as a problem (and internal inconsistency within the standard library for dump/load protocols) with the current API.

ncoghlan commented 10 years ago

The other simple solution would be to add \<name>b variants of the affected APIs. That's a bit ugly though, especially since it still has the problem of making it difficult to write correct cross-version code (although that problem is likely to exist regardless)

pitrou commented 10 years ago

Still, JSON itself is not a wire protocol; HTTP is. http://www.json.org states that "JSON is a text format" and the grammar description talks "UNICODE characters", not bytes. The ECMA spec states that "JSON text is a sequence of Unicode code points".

RFC 4627 is a bit more affirmative, though, and says that "JSON text SHALL be encoded in Unicode [sic]. The default encoding is UTF-8".

Related issues:

The other simple solution would be to add \<name>b variants of the affected APIs.

"dumpb" is not very pretty and can easily be misread as "dumb" :-) "dump_bytes" looks better to me.

serhiy-storchaka commented 10 years ago

I propose close this issue as a duplicate of bpo-10976.

ncoghlan commented 10 years ago

Not sure yet if we should merge the two issues, although they're the serialisation and deserialisation sides of the same problem.

Haskell seems to have gone with the approach of a separate "jsonb" API for the case where you want the wire protocol behaviour, such a solution may work for us as well.

pitrou commented 10 years ago

I'm -1 for a new module doing almost the same thing. Let's add distinct APIs in the existing json module.

ncoghlan commented 10 years ago

The problem with adding new APIs with different names to the JSON module is that it breaks symmetry with other wire protocols. The quartet of module level load, loads, dump and dumps functions has become a de facto standard API for wire protocols.

If it wasn't for that API convention, the status quo would be substantially less annoying (and confusing) than it currently is.

The advantage of a separate "jsonb" module is that it becomes easy to say "json is the text transform that dumps and loads from a Unicode string, jsonb is the wire protocol that dumps and loads a UTF encoded byte sequence".

Backporting as simplejsonb would also work in a straightforward fashion (since one PyPI package can include multiple top level Python modules).

The same approach would also extend to fixing the xmlrpc module to handle the encoding step properly (if anyone was so inclined).

pitrou commented 10 years ago

The problem with adding new APIs with different names to the JSON module is that it breaks symmetry with other wire protocols. The quartet of module level load, loads, dump and dumps functions has become a de facto standard API for wire protocols.

Breaking symmetry is terribly less silly than having a second module doing almost the same thing, though.

The advantage of a separate "jsonb" module is that it becomes easy to say "json is the text transform that dumps and loads from a Unicode string, jsonb is the wire protocol that dumps and loads a UTF encoded byte sequence".

This is a terribly lousy design.

serhiy-storchaka commented 10 years ago

I agree that adding a new module is very bad idea.

I think that the reviving the encoding parameter is a lest wrong way. json.dumps() should return bytes when the encoding argument is specifiead and str otherwise. json.dump() should write binary data when the encoding argument is specifiead and a text otherwise. This is not perfect design, but it has precendences in XML modules.

ncoghlan commented 10 years ago

Changing return type based on argument *values* is still a bad idea in general.

It also makes it hard to plug the API in to generic code that is designed to work with any dump/load based serialisation protocol.

MvL suggested a json.bytes submodule (rather than a separate top level module) in the other issue and that sounds reasonable to me, especially since json is already implemented as a package.

pitrou commented 10 years ago

MvL suggested a json.bytes submodule (rather than a separate top level module) in the other issue and that sounds reasonable to me, especially since json is already implemented as a package.

I don't really find it reasonable to add a phantom module entirely for the purpose of exposing an API more similar to the Python 2 one. I don't think this design pattern has already been used.

If we add a json_bytes method, it will be simple enough for folks to add the appropriate rules in their compat module (and/or for six to expose it).

ncoghlan commented 10 years ago

The parallel API would have to be:

json.dump_bytes json.dumps_bytes json.load_bytes json.loads_bytes

That is hardly an improvement over:

json.bytes.dump json.bytes.dumps json.bytes.load json.bytes.loads

It doesn't need to be documented as a completely separate module, it can just be a subsection in the json module docs with a reference to the relevant RFC.

The confusion is inherent in the way the RFC was written, this is just an expedient way to resolve that: the json module implements the standard, the bytes submodule implements the RFC.

"Namespaces are a honking great idea; let's do more of those"

pitrou commented 10 years ago

The parallel API would have to be:

json.dump_bytes json.dumps_bytes json.load_bytes json.loads_bytes

No, only one function dump_bytes() is needed, and it would return a bytes object ("dumps" meaning "dump string", already). loads() can be polymorphic without creating a new function.

I don't think the functions taking file objects are used often enough to warrant a second API to deal with binary files.

It doesn't need to be documented as a completely separate module, it can just be a subsection in the json module docs with a reference to the relevant RFC.

It's still completely weird and unusual.

"Namespaces are a honking great idea; let's do more of those"

And also "flat is better than nested".

Especially when you're proposing than one API be at level N, and the other, closely related API be at level N+1.

serhiy-storchaka commented 10 years ago

Changing return type based on argument *values* is still a bad idea in general.

However load() and loads() do this. ;)

It also makes it hard to plug the API in to generic code that is designed to work with any dump/load based serialisation protocol.

For dumps() it will be simple -- lambda x: json.dumps(x, encoding='utf-8'). For loads() it will be even simpler -- loads() will accept both strings and bytes.

Note that dumps() with the encoding parameter will be more 2.x compatible than current implementation. This will help in writing compatible code.

terryjreedy commented 10 years ago

Changing return type based on argument *values* is still a bad idea in general.

I understand the proposal to be changing the return based on argument *presence*. It strikes me a a convenient abbreviation for making a separate encoding call and definitely (specifically?) less bad than a separate module or separate functions.

pitrou commented 10 years ago

To give another data point: returning a different type based on argument value is also what the open() functions does, more or less.

(that said, I would slightly favour a separate dump_bytes(), myself)

gpshead commented 10 years ago

upstream simplejson (of which json is an earlier snapshot of) has an encoding parameter on its dump and dumps method. Lets NOT break compatibility with that API.

Our users use these modules interchangeably today, upgrading from stdlib json to simplejson when they need more features or speed without having to change their code.

simplejson's dumps(encoding=) parameter tells the module what encoding to decode bytes objects found within the data structure as (whereas Python 3.3's builtin json module being older doesn't even support that use case and raises a TypeError when bytes are encountered within the structure being serialized).

http://simplejson.readthedocs.org/en/latest/

A json.dump_bytes() function implemented as:

def dump_bytes(*args, **kwargs):
  return dumps(*args, **kwargs).encode('utf-8')

makes some sense.. but it is really trivial for anyone to write that .encode(...) themselves.

a dump_bytes_to_file method that acts like dump() and calls .encode('utf-8') on all str's before passing them to the write call is also doable... but it seems easier to just let people use an existing io wrapper to do that for them as they already are.

As for load/loads, it is easy to allow that to accept bytes as input and assume it comes utf-8 encoded. simplejson already does this. json does not.

gpshead commented 10 years ago

So why not put a dump_bytes into upstream simplejson first, then pull in a modern simplejson?

There might be some default flag values pertaining to new features that need changing for stdlib backwards compatible behavior but otherwise I expect it's a good idea.

01e27b45-90f2-4c74-9e5e-7e7e54c3d78e commented 8 years ago

One of the problem, that decodeing JSON is FSM, where input is one symbol rather than one byte. AFAIK, Python still does not have FSM for decoding UTF-8 sequence, so iterative decoding of JSON will require more changes than expected.

01e27b45-90f2-4c74-9e5e-7e7e54c3d78e commented 8 years ago

In real life, I can confirm, that porting from Python2 to Python3 is almost automatic except JSON-related fixes.

ncoghlan commented 8 years ago

I'm currently migrating a project that predates requests, and ended up needing to replace several "json.loads" calls with a "_load_json" helper that is just an alias for json.loads in Python 2, and defined as this in Python 3:

    def _load_json(data):
        return json.loads(data.decode())

To get that case to "just work", all I would have needed is for json.loads to accept bytes input, and assume it is UTF-8 encoded, that same way simplejson does. Since there aren't any type ambiguities associated with that, I think it would make sense for us to go ahead and implement at least that much for Python 3.6.

By contrast, if I'd been doing *encoding, I don't think there's anything the Python 3 standard library could have changed *on its own to make things just work - I would have needed to change my code somehow.

However, a new "dump_bytes" API could still be beneficial on that front as long as it was also added to simplejson: code that needed to run in the common Python 2/3 subset could use "simplejson.dump_bytes", while 3.6+ only code could just use the standard library version.

Having dump_bytes() next to dumps() in the documentation would also provide a better hook for explaining the difference between JSON-as-text-encoding (with "str" output) and JSON-as-wire-encoding (with "bytes" output after encoding the str representation as UTF-8).

In both cases, I think it would make sense to leave the non-UTF-8 support to simplejson and have the stdlib version be UTF-8 only.

serhiy-storchaka commented 8 years ago

Does dump_bytes() return bytes (similar to dumps()) or write to binary stream (similar to dump())?

ncoghlan commented 8 years ago

dump_bytes() would be a binary counterpart to dumps()

The dump() case is already handled more gracefully, as the implicit encoding to UTF-8 can live on the file-like object, rather than needing to be handled by the JSON encoder.

I'm still not 100% sure on its utility though - it's only "json.loads assuming binary input is UTF-8 encoded text would be way more helpful than the current behaviour" that I'm confident about. If the assumption is wrong, you'll likely fail JSON deserialisation anyway, and when it's right, the common subset of Python 2 & 3 has been expanded in a useful way.

So perhaps we should split the question into two issues? A new one for accepting binary data as an input to json.loads, and make this one purely about whether or not to offer a combined serialise-and-encode operation for the wire protocol use case?

ncoghlan commented 8 years ago

After hitting this problem again in another nominally single-source compatible Python 2/3 project, I created bpo-27765 to specifically cover accepting UTF-8 encoded bytes in json.loads()

ncoghlan commented 8 years ago

For 3.6, the decoding case has been handled via Serhiy's autodetection patch in bpo-17909

6c3ac34f-0820-49a0-a9f4-42255a5731e4 commented 7 years ago

and for *encoding* case? Can you just add the encoding argument back to json.dumps? Have it default to None because of backwards compatibility in python3 and continue to return strings by default...

... and then *everyone that ever wants to *serialize an object to json because they want to put it on a wire or w/e will hopefully someday learn when you call json.dumps you *always* set encoding='utf-8' and it will always return utf-8 encoded bytes (which is the same thing it would have done py2 regardless)?

Is it confusing for the py3 encoding argument to mean something different than py2? Probably? The encoding argument in py2 was there to tell the Encoder how to decode keys and values who's strings were acctually utf-8 encoded bytes. But w/e py3 doesn't have that problem - so py3 can unambiguously hijack dumps' encoding param to mean bytes! Then, sure, maybe the fact I can write:

    sock.send(json.dumps(obj, encoding='utf-8'))

... in either language is just a happy coincidence - but it'd be useful nevertheless.

Or I could be wrong. I've not been thinking about this for 3 years. But I have bumped into this a couple of times in the years since starting to dream of python 3.2^H4^H5^H6^H7 support - but until then I do seem to frequently forget json.dumps(obj).decode('utf-8') so maybe my suggestion isn't really any better!?

stalkerg commented 1 year ago

Because orjson became popular and it uses only bytes for loads and dumps, I believe we should solve it somehow. Integrations with frameworks and libraries became a nightmare. In most cases, with web and files, we do not need str we are working with bytes. I agree we can't break backward compatibility but why we can't just add one more function dumpb? In that case semantic will be full and libraries can make proper integration. (for orjson we can add encoding for dumps and return as is for dumpb)