matrix-org / python-canonicaljson

Canonical JSON
Apache License 2.0
31 stars 15 forks source link

Remove simplejson dependency #23

Open palango opened 5 years ago

palango commented 5 years ago

Currently the package pulls simplejson as a dependency.

This caused some unexpected trouble in https://github.com/raiden-network/raiden/issues/4174 . It also seems that since python 2.6 simplejson is no longer a big performance gain: https://github.com/kennethreitz/requests/issues/3052#issuecomment-197021335

So my idea would be to remove the dependency. If you agree I can create a PR.

richvdh commented 5 years ago

Hrm.

python 2.7 improved the performance of the stdlib json serialiser, but it did not integrate all of the optimisations from simplejson, and as the comment says, I still found it to be substantially slower than simplejson.

I'd therefore be reluctant to simply discard simplejson without further benchmarking. I'd be open to dropping the dep on simplejson if it could be shown that on recent (say, python 3.5) versions of the stdlib the performance was comparable with simplejson.

I'd also consider making the dependency optional, but am wary of introducing strange effects like that seen in https://github.com/kennethreitz/requests/issues/3052.

I'm not quite clear on what problem you're trying to solve: is simplejson failing to parse valid json?

richvdh commented 5 years ago

I guess a final option would be to use json by default but allow the caller to pass in an alternative json implementation as a keyword param.

konradkonrad commented 5 years ago

I'm not quite clear on what problem you're trying to solve: is simplejson failing to parse valid json?

No, the problem we experience is ultimately to blame on https://github.com/kennethreitz/requests/issues/3052 -- when trying to handle parsing exceptions, we mismatch on the exception type, because requests uses simplejson if available. Since pulling matrix-synapse as a dependency, this is the case now.

Unfortunately it doesn't look like https://github.com/kennethreitz/requests/pull/3056 will ever make it into the requests 2.x series, or be released before 2020 (see yellow note here), so politely asking to get rid of the dependency is our way of avoiding monkey patching or code arounds.

I guess a final option would be to use json by default but allow the caller to pass in an alternative json implementation as a keyword param.

Allowing simplejson as an extra could be a possibility, too, no?

palango commented 5 years ago

I'd therefore be reluctant to simply discard simplejson without further benchmarking. I'd be open to dropping the dep on simplejson if it could be shown that on recent (say, python 3.5) versions of the stdlib the performance was comparable with simplejson.

According to https://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/ json in python 3.5 is slightly faster than simplejson, but slower in python 2.7. SO I guess it depends on how important python 2.7 is for the user base.

Allowing simplejson as an extra could be a possibility, too, no?

That sounds like a good solution, keeping the option to use simplejsonn for python 2.7 users, but defaulting to json.

richvdh commented 5 years ago

According to https://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/ json in python 3.5 is slightly faster than simplejson, but slower in python 2.7. SO I guess it depends on how important python 2.7 is for the user base.

I looked at those metrics when I did the benchmarking I referred to earlier. What I found is that the performance of the various json libraries varied dramatically according to other options. In particular, stock json was much slower than simplejson with sort_keys=True (as used by canonicaljson). So I'm afraid those metrics aren't much use here.

Allowing simplejson as an extra could be a possibility, too, no?

That sounds like a good solution, keeping the option to use simplejsonn for python 2.7 users, but defaulting to json.

Isn't this what requests did, and which caused you all sorts of problems? I'm not enthusiastic about repeating that mistake. I'd prefer the idea of letting the caller actively choose to pass in an alternative implementation.

richvdh commented 4 years ago

@palango any further thoughts here?

richvdh commented 4 years ago

hrm, I did some benchmarking (based https://github.com/richvdh/json_benchmark).

using python 3.7.5 (seconds per loop):

                             dumps (large obj)   dumps (small objs)
json 2.0.9                               0.015                0.020
simplejson 3.17.0                        0.030                0.046

So yeah, we should probably switch back to stock json.

richvdh commented 4 years ago

it seems like there was a substantial change to the performance of the stdlib json somewhere between python 2.7 and 3.5. Now that 2.7 is unsupported, I'm not too worried about any performance regressions for it.

richvdh commented 4 years ago

as per the benchmarks on #9, this would also allow us to remove the _unascii magic.

richvdh commented 4 years ago

An update on the situation here: as of https://github.com/matrix-org/python-canonicaljson/pull/25, it is possible to use stdlib json rather than simplejson for encoding; however, we still pull it in with setup.py, so this wouldn't solve the original reporter's problem.

Properly dropping the simplejson dependency requires thought about how to do so without breaking compatibility with older applications that might pass bytes values into canonicaljson.encode.