swaroopch / edn_format

EDN reader and writer implementation in Python, using PLY (lex, yacc)
https://swaroopch.com/2012/12/24/edn-format-python/
Other
131 stars 31 forks source link

Add indent parameter in edn_format.dumps() #64

Closed thiagokokada closed 4 years ago

thiagokokada commented 4 years ago

This adds a pretty printer similar to json.dumps(indent=<int>). However, it does not follow Clojure formatting guidelines, instead formatting in a way more common to users from other languages like Python.

So it will convert this:

{"a": 1, "b": (1, 2, 3), "c": {"d": [1, 2, 3]}, "e": {1, 2, 3}}

To this:

{
  :a 1
  :b (
    1
    2
    3
  )
  :c {
    :d [
      1
      2
      3
    ]
  }
  :e #{
    1
    2
    3
  }
}

Instead of this:

{
  :a 1
  :b (1 2 3)
  :c {
    :d [1 2 3]
  }
  :e #{1 2 3}
}

This should be already better than the current status quo (that is, no pretty printer at all).

Should fix issue https://github.com/swaroopch/edn_format/issues/39 (unless the author of the issue wanted a more Clojure-like pretty printer).

And sorry about the messy code, I didn't found a way to implement it cleaner.

thiagokokada commented 4 years ago

I discovered a problem when using strings and any other data structure that is not a map (since my use case is mostly maps I didn't find this issue until now). If someone has a good solution for this I am all ears, since the solution I found is mostly duplicating the udump() logic.

Fixed, the solution is quite ugly though.

thiagokokada commented 4 years ago

I think this PR is ready to review.

thiagokokada commented 4 years ago

Since we have mutual recursion going on here, I'm curious if you know the performance impact?

Sorry, I don't. However I think it should be fine for small EDNs (that is my use case).

I can do some performance tests if you want.

swaroopch commented 4 years ago

Since we have mutual recursion going on here, I'm curious if you know the performance impact?

Sorry, I don't. However I think it should be fine for small EDNs (that is my use case).

I can do some performance tests if you want.

This is an optional step. Would appreciate if you can take this additional request :)

bfontaine commented 4 years ago

I have a strange bug on this EDN:

>>> edn = open("bench.edn").read()
>>> from edn_format import loads, dumps
>>> dumps(loads(edn)) == edn
True
>>> dumps(loads(dumps(loads(edn)))) == edn
True
>>> dumps(loads(dumps(loads(edn), indent=2))) == edn
False

dumping it after loading it works fine but when I add indent=2 all I get is a vector of keys.

Without indent=2 (same file): https://gist.githubusercontent.com/bfontaine/c3463de9f1ef3a802e8fdce84bfb3adb/raw/6214e56f4772483dd5b326b31c2ec6301816a65a/without-indent.edn With indent=2: https://gist.githubusercontent.com/bfontaine/c3463de9f1ef3a802e8fdce84bfb3adb/raw/6214e56f4772483dd5b326b31c2ec6301816a65a/with-indent.edn


I compared the performances of dumping this file 10 × 1000 times on both the master branch and this one (without indent):

# master
         5596006 function calls (5026006 primitive calls) in 4.095 seconds
         5596006 function calls (5026006 primitive calls) in 3.922 seconds
         5596006 function calls (5026006 primitive calls) in 3.918 seconds
         5596006 function calls (5026006 primitive calls) in 4.055 seconds
         5596006 function calls (5026006 primitive calls) in 4.188 seconds
         5596006 function calls (5026006 primitive calls) in 4.303 seconds
         5596006 function calls (5026006 primitive calls) in 4.044 seconds
         5596006 function calls (5026006 primitive calls) in 4.058 seconds
         5596006 function calls (5026006 primitive calls) in 3.940 seconds
         5596006 function calls (5026006 primitive calls) in 3.997 seconds
# this branch
         6071006 function calls (5501006 primitive calls) in 4.907 seconds
         6071006 function calls (5501006 primitive calls) in 4.670 seconds
         6071006 function calls (5501006 primitive calls) in 5.659 seconds
         6071006 function calls (5501006 primitive calls) in 5.580 seconds
         6071006 function calls (5501006 primitive calls) in 5.186 seconds
         6071006 function calls (5501006 primitive calls) in 4.431 seconds
         6071006 function calls (5501006 primitive calls) in 4.289 seconds
         6071006 function calls (5501006 primitive calls) in 4.438 seconds
         6071006 function calls (5501006 primitive calls) in 4.344 seconds
         6071006 function calls (5501006 primitive calls) in 4.361 seconds

On another, larger file (200 dumps):

# master
         6601806 function calls (5506206 primitive calls) in 4.298 seconds
         6601806 function calls (5506206 primitive calls) in 4.049 seconds
         6601806 function calls (5506206 primitive calls) in 4.004 seconds
         6601806 function calls (5506206 primitive calls) in 3.977 seconds
         6601806 function calls (5506206 primitive calls) in 4.483 seconds
         6601806 function calls (5506206 primitive calls) in 4.453 seconds
         6601806 function calls (5506206 primitive calls) in 3.814 seconds
         6601806 function calls (5506206 primitive calls) in 3.768 seconds
         6601806 function calls (5506206 primitive calls) in 3.846 seconds
         6601806 function calls (5506206 primitive calls) in 5.382 seconds
# this branch
         7682806 function calls (6587206 primitive calls) in 5.802 seconds
         7682806 function calls (6587206 primitive calls) in 4.989 seconds
         7682806 function calls (6587206 primitive calls) in 4.673 seconds
         7682806 function calls (6587206 primitive calls) in 4.730 seconds
         7682806 function calls (6587206 primitive calls) in 4.650 seconds
         7682806 function calls (6587206 primitive calls) in 4.689 seconds
         7682806 function calls (6587206 primitive calls) in 4.816 seconds
         7682806 function calls (6587206 primitive calls) in 4.785 seconds
         7682806 function calls (6587206 primitive calls) in 5.059 seconds
         7682806 function calls (6587206 primitive calls) in 4.717 seconds

All this (small) overhead is due to kwargs. If you add indent_step=0 to the arguments the overhead completely disappears:

diff --git a/edn_format/edn_dump.py b/edn_format/edn_dump.py
index cbd8669..1828ec5 100644
--- a/edn_format/edn_dump.py
+++ b/edn_format/edn_dump.py
@@ -92,7 +92,7 @@ def udump(obj,
           sort_keys=False,
           sort_sets=False,
           indent=None,
-          **kwargs):
+          indent_step=0):

     kwargs = {
         "string_encoding": string_encoding,
@@ -100,7 +100,7 @@ def udump(obj,
         "sort_keys": sort_keys,
         "sort_sets": sort_sets,
         "indent": indent,
-        "indent_step": kwargs.get("indent_step", 0)
+        "indent_step": indent_step,
     }

     if obj is None:
# on this branch, with the change above
        6601806 function calls (5506206 primitive calls) in 4.249 seconds
(etc)

My bench script, if you want to play with it: https://gist.githubusercontent.com/bfontaine/c3463de9f1ef3a802e8fdce84bfb3adb/raw/297050dc4b790fd5a2b0328397e67907afdcde57/bench.py

thiagokokada commented 4 years ago

I have a strange bug on this EDN:

Well, I will try to investigate however I don't know if I can fix. Even I don't really like this double recursion, and I will admit that is kind difficult to understand what is happening.

All this (small) overhead is due to kwargs. If you add indent_step=0 to the arguments the overhead completely disappears:

Well, back to a function argument I guess. Since the _indent_objs is now private this maybe fine, however now indent_step is an argument to udump (this may be fine though, since the actual API is dumps AFAIK).

thiagokokada commented 4 years ago

This is what it transformed the EDN:

[
  :_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
  :_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
  :_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
  :_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
  :_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
  :_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
]

So instead of parsing the array with maps it simply get the keys instead. I will try to investigate what is happening here.

bfontaine commented 4 years ago

Looks like it’s the seq call in if isinstance(obj, INDENTABLE_STRUCTURES): it’s meant for iterables, so when called on a dict it iterates over its keys rather than the pair of keys. You should add an indented version to the loads(dumps(…)) tests to catch that sort of thing: maybe modify check_roundtrip so it checks both loads(dumps(…)) and loads(dumps(…, indent=2)).

I also think it’s important to have code that’s easy to understand so one can easily extend it and/or fix it if there’s a bug. I can’t promess anything, but I should be able to look at this before the end of the week.

thiagokokada commented 4 years ago

I have a strange bug on this EDN:

Fixed. I also add the bench.edn file and created a test to ensure that this works as it should.

However, I don't like my solution (since it adds even another recursion), so if @bfontaine wants to try if there is a better way to fix it you're welcome.

thiagokokada commented 4 years ago

In commit dcb75e3, I add support for indent=0 case. Before, indent=0 and indent=None where the same, however now indent=0 pretty prints without any indentation.

This is the same behavior as json.dumps.

bfontaine commented 4 years ago

While reviewing I found another issue:

import edn_format
d = [[[1,[2,3],[[4,[5,[6],7]],8],9,[[10]],11],12],[13]]
print(edn_format.dumps(d, indent=2))
[
  [
    1
    2 3
    [
      4
      5 [
        6
      ] 7
    ] 8
    9
    [
      10
    ]
    11
  ] 12
  13
]

Also:

>>> print(edn_format.dumps([1,2,3],indent=2))
[
  1
  2
  3
]
>>> print(edn_format.dumps([[1,2,3]],indent=2))
[
  1 2 3
]

Edit: also:

>>> print(edn_format.dumps([{"a": 42}], keyword_keys=True))
[{:a 42}]
>>> print(edn_format.dumps([{"a": 42}], keyword_keys=True, indent=2))
[
  {
    "a" 42
  }
]
bfontaine commented 4 years ago

Looking into this I think it’d be a bit simpler if we first tidy up the way we pass around the dump config. Right now all these sort_keys/sort_sets/keyword_keys/etc are passed around in kwargs every time we call another function. It’s probably the time to stop using plain functions and use an EDNEncoder object, mirroring json.JSONEncoder:

# pseudo-code so you understand what I mean
class EDNEncoder(object):
    def __init__(self, sort_keys=False, sort_sets=False):
        self.sort_keys = sort_keys
        self.sort_sets = sort_sets

    def encode(self, obj):
        if is_collection(obj):
            # because all options are on the encoder we don't need to pass them
            # as arguments
            return '[{}]'.format(" ".join(self._encode_seq(obj)))
        else:
            return str(obj)

    def _encode_seq(self, objs):
        return [self.encode(o) for o in objs]

Then:

def dump(obj, **kwargs):
    return EDNEncoder(**kwargs).encode(obj)

It would make it easier to add more internal attributes such as indent_step.

@swaroopch Thoughts? If this is fine for you I can move forward and refactor the code using a class as above.

thiagokokada commented 4 years ago
>>> print(edn_format.dumps([{"a": 42}], keyword_keys=True, indent=2))
[
  {
    "a" 42
  }
]

This one is correct, the key and value inside a map shouldn't have a linebreak.

In all other examples, yeah, they're wrong.

thiagokokada commented 4 years ago
>>> print(edn_format.dumps([1,2,3],indent=2))
[
  1
  2
  3
]
>>> print(edn_format.dumps([[1,2,3]],indent=2))
[
  1 2 3
]

BTW, this same bug should also happen with tuples since the logic is the same.

bfontaine commented 4 years ago
>>> print(edn_format.dumps([{"a": 42}], keyword_keys=True, indent=2))
[
  {
    "a" 42
  }
]

This one is correct, the key and value inside a map shouldn't have a linebreak.

In all other examples, yeah, they're wrong.

No I mean the map key is not keywordized. It works if we dump a map but not when it’s inside a collection.

thiagokokada commented 4 years ago

No I mean the map key is not keywordized. It works if we dump a map but not when it’s inside a collection.

Ah ok, makes sense.

Well, the code kind works for my use case (I have mostly maps that seems to work well in the current iteration) and the algorithm is kind simple (just print the opening symbol at current indent level, indent all entries inside it and print closing symbol at previous current level), however what is making difficult to fix all those issues is this three functions that need coordination to do it (seq, _udump_indent_collection and udump).

I will probably wait for @bfontaine refactor to try again. Or if someone else wants to try and fix it, be my guest.

thiagokokada commented 4 years ago

Substitute PR: https://github.com/swaroopch/edn_format/pull/70