matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

switch back to stdlib json #7674

Closed richvdh closed 4 years ago

richvdh commented 4 years ago

I found my benchmark data and it looks like stdlib json is faster than simplejson for dumping data nowadays.

Using python 3.7.5 (seconds per loop: smaller is better):

Results
=======
                             loads (large obj)   loads (small objs)    dumps (large obj)   dumps (small objs)
json 2.0.9                               0.012                0.011                0.013                0.015
simplejson 3.17.0                        0.011                0.011                0.019                0.027
richvdh commented 4 years ago

see also: https://github.com/matrix-org/synapse/issues/3009, https://github.com/matrix-org/python-canonicaljson/issues/23

auscompgeek commented 4 years ago

Did a little experimenting, and it turns out that:

So if canonicaljson is changed to use the stdlib json, either it'll have to decode bytestrings it gets, or synapse needs a patch to stop passing bytes to canonicaljson.

auscompgeek commented 4 years ago

Actually it's even worse than that: some of the C-S handlers also return bytes, I just found /voipServer does.

richvdh commented 4 years ago

to demonstrate the problem:

>>> import simplejson
>>> simplejson.dumps(b"abc")
'"abc"'
>>> import json
>>> json.dumps(b"abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'bytes' is not JSON serializable
clokep commented 4 years ago

I think it was mentioned in #synapse-dev:matrix.org, but I believe the first step of this was to fix the places where Synapse passes bytes into canonicaljson.

richvdh commented 4 years ago

Next steps on this:

While we're working on canonicaljson, assuming we decide it is safe to switch to stock json, it would be good to revert the non-test parts of https://github.com/matrix-org/python-canonicaljson/pull/9, which, as per the comments there, are not required with stock json.

clokep commented 4 years ago

Looks like we also directly use simplejson in a bunch of places in Synapse. I'll clean those up too.

clokep commented 4 years ago

The build in #7803 passed, which means that current develop can run with the standard library JSON module.

I don't think we can just merge matrix-org/python-canonicaljson#24 and release canonicaljson though since older Synapse's might upgrade to this newer version of canonicaljson, which would break them.

We could release a Synapse which sets a maximum version of canonicaljson to the current version, then release canonicaljson and remove the max limit in Synapse? Not sure if there's a better way to do that.

clokep commented 4 years ago

A few more options from Rich (see conversation):

  • make canonicaljson import one of two libraries depending on what you set some global variable to

This seems plausible, although is a bit ugly.

  • flip around which level picks the json impl. So rather than doing from canonicaljson import json everywhere, have synapse pass in a json impl when it invokes CJ

This would work, but means we'll have to constantly recreate JSON encoders, which would probably have a performance impact. (Or I guess we could have an initialization function which would get called to initialize the encoders or something?)

  • have a whole new library called canonicaljson2

This would work fine, but will require changing the package and module names, updating Synapse to use canonicaljson2, and all of the imports.

  • vendor the 10 lines of python that we actually need into Synapse

I assume there was a good reason to have this be a separate library in the first place, not sure if those reasons are still valid.

I think the easiest way forward is the first option:

The major downside for any of the options that have us keeping support for older Synapses in canonicaljson is that matrix-org/python-canonicaljson#24 cannot be merged, we can tweak something so that most of the benefits are used though. The only way I see around that is releasing it as a canonicaljson2 library.

clokep commented 4 years ago

My current plan for this is:

I think this is tenable and isn't too much of a faff.

clokep commented 4 years ago

While workin gon #7670 I realized that this isn't fully done. We do use the standard library JSON for all all canonicaljson operations, but not for other operations.

There's a handful of places that we from canonicaljson import json, which actually ends up with the simplejson implementation. Two ways to fix this:

Any thoughts on which of these is the "right" way to go?

richvdh commented 4 years ago

I'm inclined towards the former. Seems simpler?

clokep commented 4 years ago

That was my thought too. I think importing from canonicaljson is actually just a bit confusing. Especially now that we explicitly declare we want to use the json from the standard library.